[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-28 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171217#comment-15171217
 ] 

Luis Filipe Nassif commented on TIKA-1824:
--

Well, PDF also can be an attachment, office documents can be into a zip file, 
and PDF and zip are in its own modules. So I think it is OK to create an email 
module.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-27 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170711#comment-15170711
 ] 

Luis Filipe Nassif commented on TIKA-1824:
--

Great job [~bobpaulin]! I suggest putting MboxParser, OutlookPSTParser and 
RFC822Parser in a separete tika-mail-parser module. OutlookPSTParser depends on 
java-lib-pst, not on POI. MboxParser depends on RFC822Parser. Unfortunately 
Outlook MSG parsing depends on POI and should stay into tika-office-parser 
module.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-05 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133821#comment-15133821
 ] 

Konstantin Gribov commented on TIKA-1824:
-

I'm on vacation now, so reveiwed this topic only briefly. Greate work (y).
I would take a look at 2.x branch after I will return. Do mbox, outlook and 
rfc822 parsers go to one module?

My +1 to prefixing `artifactId`s with `tika-parser(s)-` or at least `tika-`. I 
personally prefer `tika-parsers-` which is eloquent and meaningfull name for 
artifact making Tika use simpler for downstream developers.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132503#comment-15132503
 ] 

Tim Allison commented on TIKA-1824:
---

bq.  Thanks so much for the feedback, these are great things to be discussing.

Yes, yes, indeed.  Thank you, [~kkrugler], [~rgauss], and of course 
[~bobpaulin]!

Consensus for now...keep as is?  Sounds good to me.

bq. so I was considering creating projects with a bundle suffix that would 
embed the dependencies individually as tika-bundle did...

Interesting.  So, OSGi aside for the following (sorry), for those with, um, 
challenged development environments (i.e. medical/financial fields where you 
might only be allowed to bring in publicly released jars), users who only 
wanted to parse pdfs, say, could then grab tika-core.jar, the tika-batch.jar, 
the orig-tika-app.jar and the tika-parser-pdf-bundle.jar and be able to parse 
pdfs?  That would be awesome from the standpoint of several use cases I've 
seen.  Did I get this right?  What do others think?



> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132507#comment-15132507
 ] 

Tim Allison commented on TIKA-1824:
---

Sorry, [~grossws], [~thaichat04] and [~lfcnassif] should have included you in 
the above! :)

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-03 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130386#comment-15130386
 ] 

Ray Gauss II commented on TIKA-1824:


bq. Thank you, Bob Paulin! Again, this is fantastic.

Indeed, thanks!

bq. Perhaps add "parser(s?) to the artifactId, e.g. tika-parser-cad-module

Now that the change is in there it seems a bit redundant to have parser and 
module in every artifact ID.  {{tika-parser-*}} follows the least to most 
specific precedence and they're so perhaps we could just remove module?

I had some concerns over the apparent duplication of dependencies / versions 
but it looks like that will be addressed in TIKA-1847.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131500#comment-15131500
 ] 

Tim Allison commented on TIKA-1824:
---

bq. Perhaps add "parser(s?) to the artifactId

Y, sorry, [~bobpaulin], now that I see it, I'm changing my mind...

Should we get rid of "tika-parser-" entirely, e.g.:

* advanced-module
* cad-module

or perhaps:

* advanced-parsers
* cad-parsers

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-03 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131650#comment-15131650
 ] 

Bob Paulin commented on TIKA-1824:
--

So before we go that way let me explain what about your previous suggestion 
made me change my mind.  Consider the developer looking at this in a lib 
directory or an IDE.  If they just see advanced-parser or cad-parser I feel 
we're opening the door for confusion as there are many other libraries that do 
parsing.  Thought it's redundant to the maintainers to have tika-parser-* as a 
prefix it could ease the life of the end user developer trying to sort out JAR 
hell on there classpath.  

Second I have the module suffix because I'm still mulling how to replace 
tika-bundle.  Currently there are still many tika dependencies that are not 
OSGi friendly.  We've been getting around this by embedding them in 
tika-bundle.  The module suffix jars do not have dependencies embedded so I was 
considering creating projects with a bundle suffix that would embed the 
dependencies individually as tika-bundle did.  I'm curious what the rest of the 
community thinks of this approach. Naturally if we figure out a way to 
eliminate the need for 2 then I agree the module suffix is redundant and can be 
removed.  My 2 cents.  Thanks so much for the feedback, these are great things 
to be discussing.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-03 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131749#comment-15131749
 ] 

Ken Krugler commented on TIKA-1824:
---

As someone who regularly deals with 100s of jars in the dependency tree, I'm a 
big +1 for having "tika-" as a prefix for every jar.

I'm less concerned about tika-cad-parsers vs. tika-parsers-cad (as an example), 
with a mild preference for the former.

I'd rather not have the module suffix, mostly because I haven't been paying any 
attention to the OSGi issues, nor do I have a use case for that yet, and thus 
it doesn't add value for me personally. But that's a very weak -1, given my 
lack of background in this space.


> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106752#comment-15106752
 ] 

Tim Allison commented on TIKA-1824:
---

Thank you, [~bobpaulin]!  Again, this is fantastic.  I should have a chance to 
take a look later today.  [~chrismattmann], [~gagravarr], [~kkrugler], 
[~lewismc],[~rgauss] or others, any feedback on this massive refactoring?

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15103913#comment-15103913
 ] 

Hudson commented on TIKA-1824:
--

SUCCESS: Integrated in tika-2.x #13 (See 
[https://builds.apache.org/job/tika-2.x/13/])
TIKA-1824 - Add CTakes resource to scientific module (bob: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725119])
* trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org
* 
trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org/apache
* 
trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org/apache/tika
* 
trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org/apache/tika/parser
* 
trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org/apache/tika/parser/ctakes
* 
trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org/apache/tika/parser/ctakes/CTAKESConfig.properties
TIKA-1824 - Remove CTakes resource from web module (bob: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725118])
* trunk/tika-parser-modules/tika-parser-web-module/src/main/resources/org


> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15103490#comment-15103490
 ] 

Hudson commented on TIKA-1824:
--

SUCCESS: Integrated in tika-2.x #11 (See 
[https://builds.apache.org/job/tika-2.x/11/])
TIKA-1824 - Lowercase parent parser module (bob: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725045])
* trunk/tika-parser-modules/pom.xml


> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15103450#comment-15103450
 ] 

Hudson commented on TIKA-1824:
--

SUCCESS: Integrated in tika-2.x #10 (See 
[https://builds.apache.org/job/tika-2.x/10/])
TIKA-1824 - Moved parser text to before module name. (bob: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725033])
* trunk/tika-parser-bundles/tika-multimedia-bundle/pom.xml
* trunk/tika-parser-modules/pom.xml
* trunk/tika-parser-modules/tika-advanced-parser-module
* trunk/tika-parser-modules/tika-cad-parser-module
* trunk/tika-parser-modules/tika-code-parser-module
* trunk/tika-parser-modules/tika-database-parser-module
* trunk/tika-parser-modules/tika-ebook-parser-module
* trunk/tika-parser-modules/tika-journal-parser-module
* trunk/tika-parser-modules/tika-multimedia-parser-module
* trunk/tika-parser-modules/tika-office-parser-module
* trunk/tika-parser-modules/tika-package-parser-module
* trunk/tika-parser-modules/tika-parser-advanced-module
* trunk/tika-parser-modules/tika-parser-advanced-module/pom.xml
* trunk/tika-parser-modules/tika-parser-advanced-module/src
* trunk/tika-parser-modules/tika-parser-cad-module
* trunk/tika-parser-modules/tika-parser-cad-module/pom.xml
* trunk/tika-parser-modules/tika-parser-cad-module/src
* trunk/tika-parser-modules/tika-parser-code-module
* trunk/tika-parser-modules/tika-parser-code-module/pom.xml
* trunk/tika-parser-modules/tika-parser-code-module/src
* trunk/tika-parser-modules/tika-parser-database-module
* trunk/tika-parser-modules/tika-parser-database-module/pom.xml
* trunk/tika-parser-modules/tika-parser-database-module/src
* trunk/tika-parser-modules/tika-parser-ebook-module
* trunk/tika-parser-modules/tika-parser-ebook-module/pom.xml
* trunk/tika-parser-modules/tika-parser-ebook-module/src
* trunk/tika-parser-modules/tika-parser-journal-module
* trunk/tika-parser-modules/tika-parser-journal-module/pom.xml
* trunk/tika-parser-modules/tika-parser-journal-module/src
* trunk/tika-parser-modules/tika-parser-multimedia-module
* trunk/tika-parser-modules/tika-parser-multimedia-module/pom.xml
* trunk/tika-parser-modules/tika-parser-office-module
* trunk/tika-parser-modules/tika-parser-office-module/pom.xml
* trunk/tika-parser-modules/tika-parser-office-module/src
* trunk/tika-parser-modules/tika-parser-package-module
* trunk/tika-parser-modules/tika-parser-package-module/pom.xml
* trunk/tika-parser-modules/tika-parser-package-module/src
* trunk/tika-parser-modules/tika-parser-pdf-module
* trunk/tika-parser-modules/tika-parser-pdf-module/pom.xml
* trunk/tika-parser-modules/tika-parser-pdf-module/src
* trunk/tika-parser-modules/tika-parser-scientific-module
* trunk/tika-parser-modules/tika-parser-scientific-module/pom.xml
* trunk/tika-parser-modules/tika-parser-scientific-module/src
* trunk/tika-parser-modules/tika-parser-text-module
* trunk/tika-parser-modules/tika-parser-text-module/pom.xml
* trunk/tika-parser-modules/tika-parser-text-module/src
* trunk/tika-parser-modules/tika-parser-web-module
* trunk/tika-parser-modules/tika-parser-web-module/pom.xml
* trunk/tika-parser-modules/tika-pdf-parser-module
* trunk/tika-parser-modules/tika-scientific-parser-module
* trunk/tika-parser-modules/tika-text-parser-module
* trunk/tika-parser-modules/tika-web-parser-module
* trunk/tika-parsers/pom.xml


> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15103416#comment-15103416
 ] 

Hudson commented on TIKA-1824:
--

SUCCESS: Integrated in tika-2.x #9 (See 
[https://builds.apache.org/job/tika-2.x/9/])
TIKA-1824 - Added SVN ignores (bob: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725015])
* trunk/tika-parser-modules/tika-advanced-parser-module
* trunk/tika-parser-modules/tika-cad-parser-module
* trunk/tika-parser-modules/tika-code-parser-module
* trunk/tika-parser-modules/tika-database-parser-module
* trunk/tika-parser-modules/tika-ebook-parser-module
* trunk/tika-parser-modules/tika-journal-parser-module
* trunk/tika-parser-modules/tika-office-parser-module
* trunk/tika-parser-modules/tika-package-parser-module
* trunk/tika-parser-modules/tika-pdf-parser-module
* trunk/tika-parser-modules/tika-scientific-parser-module
* trunk/tika-parser-modules/tika-text-parser-module
TIKA-1824 - Big Renaming.  Adding parsers to the artifact names and 
descriptions. (bob: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725014])
* trunk/tika-parser-bundles/tika-multimedia-bundle/pom.xml
* trunk/tika-parser-modules/pom.xml
* trunk/tika-parser-modules/tika-advanced-module
* trunk/tika-parser-modules/tika-advanced-parser-module
* trunk/tika-parser-modules/tika-advanced-parser-module/pom.xml
* trunk/tika-parser-modules/tika-advanced-parser-module/src
* trunk/tika-parser-modules/tika-advanced-parser-module/src/main
* trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java
* trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org
* trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/crypto
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/crypto/Pkcs7Parser.java
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/NERecogniser.java
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/NamedEntityParser.java
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/corenlp
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/corenlp/CoreNLPNERecogniser.java
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/opennlp
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNERecogniser.java
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNameFinder.java
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/regex
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/regex/RegexNERecogniser.java
* trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/META-INF
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/META-INF/services
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
* trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache/tika
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache/tika/parser
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache/tika/parser/ner
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache/tika/parser/ner/regex
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache/tika/parser/ner/regex/ner-regex.txt
* trunk/tika-parser-modules/tika-advanced-parser-module/src/test
* trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java
* trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java/org
* trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java/org/apache
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java/org/apache/tika
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java/org/apache/tika/parser
* 
trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java/org/apache/tika/parser/crypto
* 

[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-14 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097884#comment-15097884
 ] 

Nick Burch commented on TIKA-1824:
--

Tika already supports using a custom classloader for loading parser + detector 
classes + spi files - 
http://tika.apache.org/1.11/api/org/apache/tika/config/TikaConfig.html#TikaConfig%28java.lang.ClassLoader%29

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096668#comment-15096668
 ] 

Uwe Schindler commented on TIKA-1824:
-

Hi, as invited on TIKA-1830, here some comments from Apache Solr:

{quote}
As already stated in the past, we would like to only bundle parsers for text 
document formats, because images, class files or else are not really useful for 
indexing by default. Users that want to do this, can still add the missing 
parser bundles and SPI will do the rest. Currently we have disabled some 
parsers by removing the JAR files (like asm-all.jar, netcdf.jar), so TIKA's SPI 
will disable them automatically (because of ClassNotFoundEx). This was a bit 
rude, but worked.

The reason for this was partly also some version incompatibilities (ASM was old 
in TIKA, Lucene needs newest one), but ASM is not really useful for indexing 
anyways!

In Solr we don't use transitive dependencies in Ivy, so we decide for each JAR 
file which one gets bundled, so we check every release anyways during update.
{quote}

In addition, it would be a good idea to allow loading the TIKA SPI files in a 
separate classloader (to isolate the parser classes from others). The reason 
for this is JAR hell. If TIKA would load the parsers in its own classloader 
(optionally, e.g. by configuration), we could place all parsers and their 
dependencies in a separate lib directory outside the Solr's lib folder.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090903#comment-15090903
 ] 

Hudson commented on TIKA-1824:
--

SUCCESS: Integrated in tika-2.x #6 (See 
[https://builds.apache.org/job/tika-2.x/6/])
TIKA-1824 - Fixed incorrect path prefix condition. (bob: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1723903])
* 
trunk/tika-test-resources/src/test/resources/org/apache/tika/parser/ner/opennlp/ModelGetter.groovy


> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-08 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090285#comment-15090285
 ] 

Bob Paulin commented on TIKA-1824:
--

* Perhaps rename artifact names in parser sub-components to include 
"Parser(s?)", e.g. Apache Tika Parser Advanced Module so that the names sort 
more clearly (at least in the maven window in Intellij)?

I think I felt it was redundant but in a maven repo it could be helpful so I 
can make that change.

* Perhaps add "parser(s?) to the artifactId, e.g. tika-parser-cad-module

Same as above.

* Perhaps lowercase names in parser-subcomponents so that they're inline with 
legacy: "Apache Tika parser advanced module"

I think I'm missing where this convention is coming from.

* Pkcs7Parser ... should that be under advanced...or somewhere else ...own 
crypto package?

So I don't feel strongly that it needs to be under advanced but I do want to be 
careful not to over do the number of modules.  Do you feel crypto has room for 
growth or is this just going to forever be a one parser project?  

* iwork ...should we move that to office?

I think it could fit there too.  No issues moving.

* tika-test-resources...should we move TikaTest into that and change the name 
to tika-test? I have a vague memory of wanting to carve out a separate test 
package earlier and adding TikaTest and something else...

I think it could work in tika-core or tika-test.  I don't think I feel strongly 
either way.

* OutlookPSTParser...move that to office?

I'd like to keep this class with all the other mbox classes.  Maybe me mbox to 
office?

* Does MBox belong in web? Not sure where to put it?

Move to office?

* Move CommonsDigester to core if we're willing to add a dependency on 
commons-codec into core?

I'm fine with this.

* Move Activator to tika-bundle?

I believe tika-bundle already has an activator.  Could just remove this.

* Move pot to multimedia or add tika-parsers-multimedia-advanced-module?

Not sure I understand POT in multimedia.  Can you elaborate?

* Move geo.topic to "advanced"...perhaps we rename "advanced" to ner?

Is ner only applied to geo?  My understanding of this domain is limited

* Move ctakes to "advanced/ner"?

Again my understanding of the domain is limited on what ctakes fits with.


* Collapse web and text?

Not sure I like that since a number of modules depend on text but not web.  
Seems like we'd be adding a lot of needless dependencies.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090355#comment-15090355
 ] 

Hudson commented on TIKA-1824:
--

UNSTABLE: Integrated in tika-2.x #5 (See 
[https://builds.apache.org/job/tika-2.x/5/])
TIKA-1824 - Move mbox to office. (bob: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1723820])
* trunk/tika-parser-modules/tika-office-module/pom.xml
* 
trunk/tika-parser-modules/tika-office-module/src/main/java/org/apache/tika/parser/mbox
* 
trunk/tika-parser-modules/tika-office-module/src/main/java/org/apache/tika/parser/mbox/MboxParser.java
* 
trunk/tika-parser-modules/tika-office-module/src/main/java/org/apache/tika/parser/mbox/OutlookPSTParser.java
* 
trunk/tika-parser-modules/tika-office-module/src/test/java/org/apache/tika/parser/mbox
* 
trunk/tika-parser-modules/tika-office-module/src/test/java/org/apache/tika/parser/mbox/MboxParserTest.java
* 
trunk/tika-parser-modules/tika-office-module/src/test/java/org/apache/tika/parser/mbox/OutlookPSTParserTest.java
* trunk/tika-parser-modules/tika-web-module/pom.xml
* 
trunk/tika-parser-modules/tika-web-module/src/main/java/org/apache/tika/parser/mbox
* 
trunk/tika-parser-modules/tika-web-module/src/test/java/org/apache/tika/parser/mbox


> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087494#comment-15087494
 ] 

Hudson commented on TIKA-1824:
--

UNSTABLE: Integrated in tika-2.x #4 (See 
[https://builds.apache.org/job/tika-2.x/4/])
TIKA-1824 - Adding parent path to tika-test-resources (bob: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1723536])
* trunk/tika-test-resources/pom.xml


> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085946#comment-15085946
 ] 

Tim Allison commented on TIKA-1824:
---

[~bobpaulin], this is an awesome step forward.  Must have been a fair amount of 
work. Thank you!

Few questions...not just for you, but for all.  I'm happy to submit/commit 
patches, but I want to make sure I don't do anything objectionable to the 
community

* This is probably user error, but I'm getting: \[ERROR\] Failed to execute 
goal org.apache.maven.plugins:maven-dependency-plugin:2.10:unpack (unpack) on 
project tika-text-module: Unable to find artifact. Could not find artifact 
org.apache.tika:tika-test-resources:jar:tests:2.0-SNAPSHOT in apache.snapshots 
(http://repository.apache.org/snapshots)
* Perhaps rename artifact names in parser sub-components to include 
"Parser(s?)", e.g. Apache Tika Parser Advanced Module so that the names sort 
more clearly (at least in the maven window in Intellij)?
* Perhaps add "parser(s?) to the artifactId, e.g. tika-parser-cad-module
* Perhaps lowercase names in parser-subcomponents so that they're inline with 
legacy: "Apache Tika parser advanced module"
* Pkcs7Parser ... should that be under advanced...or somewhere else ...own 
crypto package?
* iwork ...should we move that to office?
* tika-test-resources...should we move TikaTest into that and change the name 
to tika-test?  I have a vague memory of wanting to carve out a separate test 
package earlier and adding TikaTest and something else...
* OutlookPSTParser...move that to office?  
* Does MBox belong in web?  Not sure where to put it?
* Move CommonsDigester to core _if_ we're willing to add a dependency on 
commons-digest into core?
* Move Activator to tika-bundle?
* Move pot to multimedia or add tika-parsers-multimedia-advanced-module?
* Move geo.topic to "advanced"...perhaps we rename "advanced" to ner?
* Move ctakes to "advanced/ner"?
* Collapse web and text?

Again, this is fantastic.  Thank you!



> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086784#comment-15086784
 ] 

Hudson commented on TIKA-1824:
--

FAILURE: Integrated in tika-2.x #3 (See 
[https://builds.apache.org/job/tika-2.x/3/])
TIKA-1824 - Disable Dependency Reduced POM in tika-parsers.  This is causing 
dependencies not to get pulled into tika-app. (bob: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1723453])
* trunk/tika-parsers/pom.xml


> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086735#comment-15086735
 ] 

Hudson commented on TIKA-1824:
--

FAILURE: Integrated in tika-2.x #2 (See 
[https://builds.apache.org/job/tika-2.x/2/])
TIKA-1824 - Added tika-test-resources to module list so it is built. (bob: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1723446])
* trunk/pom.xml


> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-06 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086170#comment-15086170
 ] 

Bob Paulin commented on TIKA-1824:
--

A bit in a rush today but the answer to bullet one is you need to build the 
tika-test-resources project first before anything else.  I think we should add 
tika-test-resources as a module to a parent pom so this happens automatically.  
Otherwise I'm sure many will hit this issue!  

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)