Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Nick Burch

On Sun, 13 Dec 2015, Bob Paulin wrote:

So in short

Source in tika-parser
Dependencies managed in tika-parser and copied to module

Source in Modules
Dependencies managed in modules and consolidated via maven shade plugin. 
Conflicting dependencies managed by maven.


IIRC there are some util / parent classes in the tika parsers module which 
many different parsers need. Where would those end up?


Thanks
Nick


Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Bob Paulin

Answers inline

On 12/14/2015 5:24 AM, Nick Burch wrote:

On Sun, 13 Dec 2015, Bob Paulin wrote:

So in short

Source in tika-parser
Dependencies managed in tika-parser and copied to module

Source in Modules
Dependencies managed in modules and consolidated via maven shade 
plugin. Conflicting dependencies managed by maven.


IIRC there are some util / parent classes in the tika parsers module 
which many different parsers need. Where would those end up?
Good question.  This would only apply if the sources were moved to the 
modules.If the parent classes only applied to specific parsers they 
would move into the modules supporting those parsers. However there are 
more broad examples where this would not make sense.   I think one 
example is the org.apache.tika.parser.utils.CommonsDigester.  Could 
classes like this be moved into tika-core?  Another option could be 
forming a tika-parser-util class but there doesn't seem to be a lot of 
classes that would fall under that module.

Thanks
Nick


Thanks,
- Bob


RE: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Allison, Timothy B.
>> example is the org.apache.tika.parser.utils.CommonsDigester.  Could classes 
>> like this be moved into tika-core? 
Y, I was not happy with the split I did with that, but I wanted to avoid adding 
a dependency on commons-codec into core.  What do others think...another 180k 
into the core jar?


 
-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com] 
Sent: Monday, December 14, 2015 9:16 AM
To: dev@tika.apache.org
Subject: Re: Tika 2.0 Source in Modules or tika-parser

Answers inline

On 12/14/2015 5:24 AM, Nick Burch wrote:
> On Sun, 13 Dec 2015, Bob Paulin wrote:
>> So in short
>>
>> Source in tika-parser
>> Dependencies managed in tika-parser and copied to module
>>
>> Source in Modules
>> Dependencies managed in modules and consolidated via maven shade 
>> plugin. Conflicting dependencies managed by maven.
>
> IIRC there are some util / parent classes in the tika parsers module 
> which many different parsers need. Where would those end up?
Good question.  This would only apply if the sources were moved to the 
modules.If the parent classes only applied to specific parsers they 
would move into the modules supporting those parsers. However there are 
more broad examples where this would not make sense.   I think one 
example is the org.apache.tika.parser.utils.CommonsDigester.  Could classes 
like this be moved into tika-core?  Another option could be forming a 
tika-parser-util class but there doesn't seem to be a lot of classes that would 
fall under that module.
> Thanks
> Nick
>
Thanks,
- Bob


RE: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Ken Krugler

> From: Bob Paulin
> Sent: December 13, 2015 7:34:03pm PST
> To: dev@tika.apache.org
> Subject: Tika 2.0 Source in Modules or tika-parser
> 
> Hi,
> 
> I've committed the first module break out to the tika 2.0 branch and I'd like 
> to discuss the possibility of moving the source code from the tika-parser 
> projects to the modules.  The implementation I committed is based on the 
> straw man version I proposed a few months ago which copies the class files to 
> the modules.  The dependencies are managed in the tika-parser project and 
> also copied and embedded into the individual modules.  If the source were 
> moved to the modules would have there own dependency management.  Then they 
> could be combine into a single jar (as the current tika-parser jar) with the 
> maven shade plugin.  Any conflicting versions in 2 separate modules would  be 
> resolved in the tika-parser via maven.
> 
> So in short
> 
> Source in tika-parser
> Dependencies managed in tika-parser and copied to module
> 
> Source in Modules
> Dependencies managed in modules and consolidated via maven shade plugin.   
> Conflicting dependencies managed by maven.

I don't have any experience with moving classes around to create modules, so my 
natural inclination is to move the sources.

As far as shared code, I think moving something like commons-codec into core 
(100K) is fine.

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Ray Gauss
I'd vote for a tiki-parser-common(s) artifact for common util classes and 
dependencies.


> On Dec 14, 2015, at 10:54 AM, Ken Krugler  wrote:
> 
> 
>> From: Bob Paulin
>> Sent: December 13, 2015 7:34:03pm PST
>> To: dev@tika.apache.org
>> Subject: Tika 2.0 Source in Modules or tika-parser
>> 
>> Hi,
>> 
>> I've committed the first module break out to the tika 2.0 branch and I'd 
>> like to discuss the possibility of moving the source code from the 
>> tika-parser projects to the modules.  The implementation I committed is 
>> based on the straw man version I proposed a few months ago which copies the 
>> class files to the modules.  The dependencies are managed in the tika-parser 
>> project and also copied and embedded into the individual modules.  If the 
>> source were moved to the modules would have there own dependency management. 
>>  Then they could be combine into a single jar (as the current tika-parser 
>> jar) with the maven shade plugin.  Any conflicting versions in 2 separate 
>> modules would  be resolved in the tika-parser via maven.
>> 
>> So in short
>> 
>> Source in tika-parser
>> Dependencies managed in tika-parser and copied to module
>> 
>> Source in Modules
>> Dependencies managed in modules and consolidated via maven shade plugin.   
>> Conflicting dependencies managed by maven.
> 
> I don't have any experience with moving classes around to create modules, so 
> my natural inclination is to move the sources.
> 
> As far as shared code, I think moving something like commons-codec into core 
> (100K) is fine.
> 
> -- Ken
> 
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 



Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Nick Burch

On 14/12/15 16:26, Ray Gauss wrote:

I'd vote for a tiki-parser-common(s) artifact for common util classes and 
dependencies.


That would make sense to me

Nick


Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Bob Paulin
So there seems to be a pretty good consensus forming around moving the
sources but some differing opinions on where to put shared parser code.

tika-parser-commons-
Pros: We would be able to keep from adding another dependency to the
tika-core project.
Cons:  All parsers would then require an additional dependency on the
tika-parser-commons artifact.

tika-core
Pros: Less hierarchy since all parsers would still just depend on tika-core
Cons: Opens the door for tika-core to grow in source and external
dependencies.

Let me know if I've missed anything.  I think I'd be leaning closer to
wanting to put the code in tika-core to limit the hierarchy but I could be
swayed if there's strong evidence that tika-cores dependencies would start
ballooning.

- Bob

On Mon, Dec 14, 2015 at 10:43 AM, Nick Burch  wrote:

> On 14/12/15 16:26, Ray Gauss wrote:
>
>> I'd vote for a tiki-parser-common(s) artifact for common util classes and
>> dependencies.
>>
>
> That would make sense to me
>
> Nick
>


Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Nick Burch

On Mon, 14 Dec 2015, Bob Paulin wrote:
So there seems to be a pretty good consensus forming around moving the 
sources but some differing opinions on where to put shared parser code.


I know it'll be a bit dull and some work, but... Could someone put 
together a list (probably in the wiki or on jira so we can edit it) of the 
candidate classes to go in core/commons, along with their dependencies?


Once we've finalised that list, the answer may become clear just from 
that!


Nick


Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Bob Paulin
Created https://issues.apache.org/jira/browse/TIKA-1812

Also included the output from jdep which shows a package by package break
down of dependencies.  Is org.apache.tika.parser.utils the only shared
package or are there others?  We can probably move this discussion to the
JIRA.

- Bob

On Mon, Dec 14, 2015 at 12:57 PM, Nick Burch  wrote:

> On Mon, 14 Dec 2015, Bob Paulin wrote:
>
>> So there seems to be a pretty good consensus forming around moving the
>> sources but some differing opinions on where to put shared parser code.
>>
>
> I know it'll be a bit dull and some work, but... Could someone put
> together a list (probably in the wiki or on jira so we can edit it) of the
> candidate classes to go in core/commons, along with their dependencies?
>
> Once we've finalised that list, the answer may become clear just from that!
>
> Nick
>