Re: Tika 2.0 Source in Modules or tika-parser
On Sun, 13 Dec 2015, Bob Paulin wrote: So in short Source in tika-parser Dependencies managed in tika-parser and copied to module Source in Modules Dependencies managed in modules and consolidated via maven shade plugin. Conflicting dependencies managed by maven. IIRC there are some util / parent classes in the tika parsers module which many different parsers need. Where would those end up? Thanks Nick
Re: Tika 2.0 Source in Modules or tika-parser
Answers inline On 12/14/2015 5:24 AM, Nick Burch wrote: On Sun, 13 Dec 2015, Bob Paulin wrote: So in short Source in tika-parser Dependencies managed in tika-parser and copied to module Source in Modules Dependencies managed in modules and consolidated via maven shade plugin. Conflicting dependencies managed by maven. IIRC there are some util / parent classes in the tika parsers module which many different parsers need. Where would those end up? Good question. This would only apply if the sources were moved to the modules.If the parent classes only applied to specific parsers they would move into the modules supporting those parsers. However there are more broad examples where this would not make sense. I think one example is the org.apache.tika.parser.utils.CommonsDigester. Could classes like this be moved into tika-core? Another option could be forming a tika-parser-util class but there doesn't seem to be a lot of classes that would fall under that module. Thanks Nick Thanks, - Bob
RE: Tika 2.0 Source in Modules or tika-parser
>> example is the org.apache.tika.parser.utils.CommonsDigester. Could classes >> like this be moved into tika-core? Y, I was not happy with the split I did with that, but I wanted to avoid adding a dependency on commons-codec into core. What do others think...another 180k into the core jar? -Original Message- From: Bob Paulin [mailto:b...@bobpaulin.com] Sent: Monday, December 14, 2015 9:16 AM To: dev@tika.apache.org Subject: Re: Tika 2.0 Source in Modules or tika-parser Answers inline On 12/14/2015 5:24 AM, Nick Burch wrote: > On Sun, 13 Dec 2015, Bob Paulin wrote: >> So in short >> >> Source in tika-parser >> Dependencies managed in tika-parser and copied to module >> >> Source in Modules >> Dependencies managed in modules and consolidated via maven shade >> plugin. Conflicting dependencies managed by maven. > > IIRC there are some util / parent classes in the tika parsers module > which many different parsers need. Where would those end up? Good question. This would only apply if the sources were moved to the modules.If the parent classes only applied to specific parsers they would move into the modules supporting those parsers. However there are more broad examples where this would not make sense. I think one example is the org.apache.tika.parser.utils.CommonsDigester. Could classes like this be moved into tika-core? Another option could be forming a tika-parser-util class but there doesn't seem to be a lot of classes that would fall under that module. > Thanks > Nick > Thanks, - Bob
RE: Tika 2.0 Source in Modules or tika-parser
> From: Bob Paulin > Sent: December 13, 2015 7:34:03pm PST > To: dev@tika.apache.org > Subject: Tika 2.0 Source in Modules or tika-parser > > Hi, > > I've committed the first module break out to the tika 2.0 branch and I'd like > to discuss the possibility of moving the source code from the tika-parser > projects to the modules. The implementation I committed is based on the > straw man version I proposed a few months ago which copies the class files to > the modules. The dependencies are managed in the tika-parser project and > also copied and embedded into the individual modules. If the source were > moved to the modules would have there own dependency management. Then they > could be combine into a single jar (as the current tika-parser jar) with the > maven shade plugin. Any conflicting versions in 2 separate modules would be > resolved in the tika-parser via maven. > > So in short > > Source in tika-parser > Dependencies managed in tika-parser and copied to module > > Source in Modules > Dependencies managed in modules and consolidated via maven shade plugin. > Conflicting dependencies managed by maven. I don't have any experience with moving classes around to create modules, so my natural inclination is to move the sources. As far as shared code, I think moving something like commons-codec into core (100K) is fine. -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Re: Tika 2.0 Source in Modules or tika-parser
I'd vote for a tiki-parser-common(s) artifact for common util classes and dependencies. > On Dec 14, 2015, at 10:54 AM, Ken Krugler wrote: > > >> From: Bob Paulin >> Sent: December 13, 2015 7:34:03pm PST >> To: dev@tika.apache.org >> Subject: Tika 2.0 Source in Modules or tika-parser >> >> Hi, >> >> I've committed the first module break out to the tika 2.0 branch and I'd >> like to discuss the possibility of moving the source code from the >> tika-parser projects to the modules. The implementation I committed is >> based on the straw man version I proposed a few months ago which copies the >> class files to the modules. The dependencies are managed in the tika-parser >> project and also copied and embedded into the individual modules. If the >> source were moved to the modules would have there own dependency management. >> Then they could be combine into a single jar (as the current tika-parser >> jar) with the maven shade plugin. Any conflicting versions in 2 separate >> modules would be resolved in the tika-parser via maven. >> >> So in short >> >> Source in tika-parser >> Dependencies managed in tika-parser and copied to module >> >> Source in Modules >> Dependencies managed in modules and consolidated via maven shade plugin. >> Conflicting dependencies managed by maven. > > I don't have any experience with moving classes around to create modules, so > my natural inclination is to move the sources. > > As far as shared code, I think moving something like commons-codec into core > (100K) is fine. > > -- Ken > > -- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > >
Re: Tika 2.0 Source in Modules or tika-parser
On 14/12/15 16:26, Ray Gauss wrote: I'd vote for a tiki-parser-common(s) artifact for common util classes and dependencies. That would make sense to me Nick
Re: Tika 2.0 Source in Modules or tika-parser
So there seems to be a pretty good consensus forming around moving the sources but some differing opinions on where to put shared parser code. tika-parser-commons- Pros: We would be able to keep from adding another dependency to the tika-core project. Cons: All parsers would then require an additional dependency on the tika-parser-commons artifact. tika-core Pros: Less hierarchy since all parsers would still just depend on tika-core Cons: Opens the door for tika-core to grow in source and external dependencies. Let me know if I've missed anything. I think I'd be leaning closer to wanting to put the code in tika-core to limit the hierarchy but I could be swayed if there's strong evidence that tika-cores dependencies would start ballooning. - Bob On Mon, Dec 14, 2015 at 10:43 AM, Nick Burch wrote: > On 14/12/15 16:26, Ray Gauss wrote: > >> I'd vote for a tiki-parser-common(s) artifact for common util classes and >> dependencies. >> > > That would make sense to me > > Nick >
Re: Tika 2.0 Source in Modules or tika-parser
On Mon, 14 Dec 2015, Bob Paulin wrote: So there seems to be a pretty good consensus forming around moving the sources but some differing opinions on where to put shared parser code. I know it'll be a bit dull and some work, but... Could someone put together a list (probably in the wiki or on jira so we can edit it) of the candidate classes to go in core/commons, along with their dependencies? Once we've finalised that list, the answer may become clear just from that! Nick
Re: Tika 2.0 Source in Modules or tika-parser
Created https://issues.apache.org/jira/browse/TIKA-1812 Also included the output from jdep which shows a package by package break down of dependencies. Is org.apache.tika.parser.utils the only shared package or are there others? We can probably move this discussion to the JIRA. - Bob On Mon, Dec 14, 2015 at 12:57 PM, Nick Burch wrote: > On Mon, 14 Dec 2015, Bob Paulin wrote: > >> So there seems to be a pretty good consensus forming around moving the >> sources but some differing opinions on where to put shared parser code. >> > > I know it'll be a bit dull and some work, but... Could someone put > together a list (probably in the wiki or on jira so we can edit it) of the > candidate classes to go in core/commons, along with their dependencies? > > Once we've finalised that list, the answer may become clear just from that! > > Nick >