Re: Subsets of tika parsers redux

2014-12-16 Thread Sergey Beryozkin
Hi Nick, I think I've actually learned a new urban dictionary word 
mentioned in this thread, 'faff' :-).


On 16/12/14 03:34, Nick Burch wrote:

On Mon, 15 Dec 2014, Sergey Beryozkin wrote:

I'm not proposing to split tika-parsers in a way that would affect the
users, tika-parsers would still be there, except that it would
strongly depend on tika-pdf and perhaps, when it is being built, it
can have its dependencies like tika-pdf shaded in/merged in to ensure
a complete backward-compatibility as far as the user expectations of
tika-parsers is concerned.


We still have the additional faff of multiple core modules, which
someone warned about in an earlier thread, and additional work for
developers, and we did try pulling out the pdf parser which didn't work,
and I'm finding having the Vorbis parsers in a different module + repo
to be a faff


I was thinking of introducing a very minimum number of extra modules (at 
the 'expense' of tika-parsers), those covering the mainstream parsers, 
the ones you mentioned earlier, pdf, plus few others. tika-parsers would 
still be effectively the same after the build time, no side-effects for 
the tika-parsers users. Perhaps it is difficult to realize practically...




My plan doesn't involve any of those problems in phase 1 - core +
parsers don't change at all, so if it doesn't work we haven't got to
work hard to undo it, and people not interested aren't affected.

Alternately, if you head back to some of the earlier threads on this,
and can come up with reasons why the objections raised there can be
overruled, we could hack up tika parsers. (I'm trying to come up with a
plan that respects previously raised issues)


Sounds good, thanks

I might experiment a bit later on and create a patch for the review but 
I'll take a pause for now

Cheers, Sergey

Nick




Re: Subsets of tika parsers redux

2014-12-15 Thread Nick Burch

On Mon, 15 Dec 2014, Sergey Beryozkin wrote:

As a first step, I thought we'd still keep the same tika-parser jar, the
only difference would be what dependencies ended up in the bundle. If
the tika-bundle-pdf has no POI jars included in it, then the Microsoft
Office related parsers shouldn't register themselves.

It would mean that the pdf bundle would have the image, microsoft etc
parser code in them, but the parsers wouldn't be registered as their
dependencies wouldn't be there.

Not sure if this can/will work, but it would mean we can do cut-down
bundles + cut-down-maven-docs, without needing to change anything else.
If it proves popular, we can then re-visit the giant tika parsers
question, but if not it shouldn't change anything. Well, that's the
theory... :)



Sorry if I haven't completely understood the idea, I think there's definitely 
something nice being suggested above, and it sounds to me as if the following 
can be one possible realization of it, as a first step for example,
- add a tika-pdf module, this will be a bundle, so it will work as a jar and 
as an OSGI bundle; the code for tika-pdf will be extracted (and removed) from 
tika-parsers


Not quite - I forsee this being OSGi only for now. Tika Parsers project 
would be unchanged, OSGi users could have tika (all) as now, or just 
tika-pdf


- tika-parsers will get updated to depend on tika-pdf - hence users working 
with tika-parsers won;t be affected


No, that's a possible phase 2 if it goes well. No change for non-OSGi 
stuff. Non-OSGi users can see the OSGi build to work out what to include 
and exclude if they want. (This means that we have a unit tested way to 
see what you do/don't want, without affecting things for the simple Tika 
users we get confused already with tika-core + tika-parsers)


- those users who want working with PDF only would ad tika-core + tika-pdf 
dependencies only


OSGi users would pick tika + tika-parsers, or tika + tika-parsers-pdf, or 
tika + tika-parsers-pdf + tika-parsers-mp3 if they want



OSGi is nicely contained, and fairly easy to unit test, so let's use that 
to test out the idea! That also solves the CXF need. Once that works, and 
once we have a tested way that everyone can see + understand, then someone 
can try to make the case for phase II where we push it to the maven pom / 
project level!


Nick


Re: Subsets of tika parsers redux

2014-12-15 Thread Sergey Beryozkin

Hi Nick,
On 15/12/14 14:02, Nick Burch wrote:

On Mon, 15 Dec 2014, Sergey Beryozkin wrote:

As a first step, I thought we'd still keep the same tika-parser jar, the
only difference would be what dependencies ended up in the bundle. If
the tika-bundle-pdf has no POI jars included in it, then the Microsoft
Office related parsers shouldn't register themselves.

It would mean that the pdf bundle would have the image, microsoft etc
parser code in them, but the parsers wouldn't be registered as their
dependencies wouldn't be there.

Not sure if this can/will work, but it would mean we can do cut-down
bundles + cut-down-maven-docs, without needing to change anything else.
If it proves popular, we can then re-visit the giant tika parsers
question, but if not it shouldn't change anything. Well, that's the
theory... :)



Sorry if I haven't completely understood the idea, I think there's
definitely something nice being suggested above, and it sounds to me
as if the following can be one possible realization of it, as a first
step for example,
- add a tika-pdf module, this will be a bundle, so it will work as a
jar and as an OSGI bundle; the code for tika-pdf will be extracted
(and removed) from tika-parsers


Not quite - I forsee this being OSGi only for now. Tika Parsers project
would be unchanged, OSGi users could have tika (all) as now, or just
tika-pdf


- tika-parsers will get updated to depend on tika-pdf - hence users
working with tika-parsers won;t be affected


No, that's a possible phase 2 if it goes well. No change for non-OSGi
stuff. Non-OSGi users can see the OSGi build to work out what to include
and exclude if they want. (This means that we have a unit tested way to
see what you do/don't want, without affecting things for the simple Tika
users we get confused already with tika-core + tika-parsers)


- those users who want working with PDF only would ad tika-core +
tika-pdf dependencies only


OSGi users would pick tika + tika-parsers, or tika + tika-parsers-pdf,
or tika + tika-parsers-pdf + tika-parsers-mp3 if they want


OSGi is nicely contained, and fairly easy to unit test, so let's use
that to test out the idea! That also solves the CXF need. Once that
works, and once we have a tested way that everyone can see + understand,
then someone can try to make the case for phase II where we push it to
the maven pom / project level!
The need of CXF (Tika) users (or of some other users with possibly 
similar requirements) is not about shipping OSGI only Tika modules but 
about having an easy option of not to having include all the 
tika-parsers. Some CXF users would work with OSGI, some not. Sorry if I 
did not clarify it.


As I said, a module marked as bundle, as opposed to a default 'jar' is 
just a plain jar with few extra META-INF instructions.


Given it, I'm not understanding why you are opposed to not having 
tika-parsers minimized as I suggested ? What exactly is your concern ?


Shipping something like tika-pdf but still keeping the PDF parsing code 
inside tika-parsers is a duplication, right ?


Thanks, Sergey






Nick




Re: Subsets of tika parsers redux

2014-12-15 Thread Nick Burch

On Mon, 15 Dec 2014, Sergey Beryozkin wrote:
I'm not proposing to split tika-parsers in a way that would affect the 
users, tika-parsers would still be there, except that it would strongly 
depend on tika-pdf and perhaps, when it is being built, it can have its 
dependencies like tika-pdf shaded in/merged in to ensure a complete 
backward-compatibility as far as the user expectations of tika-parsers 
is concerned.


We still have the additional faff of multiple core modules, which 
someone warned about in an earlier thread, and additional work for 
developers, and we did try pulling out the pdf parser which didn't work, 
and I'm finding having the Vorbis parsers in a different module + repo to 
be a faff


My plan doesn't involve any of those problems in phase 1 - core + parsers 
don't change at all, so if it doesn't work we haven't got to work hard to 
undo it, and people not interested aren't affected.


Alternately, if you head back to some of the earlier threads on this, and 
can come up with reasons why the objections raised there can be overruled, 
we could hack up tika parsers. (I'm trying to come up with a plan that 
respects previously raised issues)


Nick


Re: Subsets of tika parsers redux

2014-11-24 Thread Sergey Beryozkin

Hi Nick

Was good talking to you and thanks for initiating this thread.

It is an interesting idea, one that can lead to introducing 
finer-grained bundles but also providing a mechanism for the 
(auto-)generation of the import metadata required by each of the parser 
modules. Besides, introducing several smaller bundles that would group 
most popular formats is a good one on its own IMHO.


My doubt here is how many of those bundles we'd need to create and if it 
will make it easy for users to get a task like Get a parser for the 
format A only, or parsers A and B formats only done.


Are we talking about introducing a parser module per every supported 
format, and having tika-parsers depend on all of those modules, with 
every parser module becoming a bundle (a jar plus an entry in the 
manifest) ?


Thanks, Sergey


On 23/11/14 17:12, Nick Burch wrote:

Hi All

During ApacheCon, I had a chance to chat with Sergey about the subset
of Tika Parsers issue that bubbles up from time to time. It seemed to
work well, and I think we both now have a better idea of the other's
needs and concerns, which is good :)

As is shown on our list from time to time, but more commonly elsewhere,
we have some users who are confused already by the split between
tika-core and tika-parsers. Anything that fragments further is going to
cause more issues for that kind of user.

On the other hand, there are potential users out there who want just a
handful of parsers, in a simple and easy and small way, who don't know a
lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of
those are using OSGi, but not all.

One suggested solution is to just document what dependencies of
tika-parsers can be excluded at the maven level to disable certain
parsers + shrink the resulting dependency tree. However, that requires
manual updates, manual checking, and like our examples on the website
risk getting out of date without automated checking.

Discussion then turned to our move to get all the examples for the
website into svn, with unit tests, and having the website pull those
from svn on the fly to always get the latest tested version.


That led to an idea. Not sure if it'll work yet, but...

What about having multiple Tika OSGi bundles? Continue with the full
bundle as now, but also have ones for pdf, microsoft office,
images etc. OSGi users (eg CXF users) could then opt to depend on
pdf+image if they only wanted a handful of parsers, or the full one as now.

The smart bit - we have unit tests for these smaller bundles. These unit
tests ensure that the desired parsers still work on their smaller
bundle. These unit tests also ensure that unwanted parsers don't work,
thus flagging up if extra dependencies have snuck though.

Finally, we pull out the includes/excludes information that went into
the bundle, and display that for non-OSGi users. A non-OSGi person
wanting tika with pdf only could then look at what the tika-pdf-bundle
does and doesn't use, and from that know what maven level dependencies
to keep and which to exclude


This new plan would mean having to tweak our build to support multiple
bundles, and potentially tweaking our bundles so that you could load
tika-pdf + tika-image and have those two play nicely together. It'd also
need some new unit tests, and the work to figure out what to
include/exclude for each of our handful of common cases. It should,
however, deliver a way for OSGi and non-OSGi people to get just a subset
if that's all they want.

Can anyone see a flaw with this plan? Anyone see a better way? Anyone
want to help? :)

Nick




Re: Subsets of tika parsers redux

2014-11-24 Thread Mattmann, Chris A (3980)
Hey Nick,

This sounds like a great plan to me, good job to you
and Sergey. As for helping I¹ll try my best, but I¹m not
an OSGI guru :)

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Nick Burch n...@apache.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Sunday, November 23, 2014 at 6:12 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Subsets of tika parsers redux

Hi All

During ApacheCon, I had a chance to chat with Sergey about the subset of
Tika Parsers issue that bubbles up from time to time. It seemed to work
well, and I think we both now have a better idea of the other's needs and
concerns, which is good :)

As is shown on our list from time to time, but more commonly elsewhere,
we 
have some users who are confused already by the split between tika-core
and tika-parsers. Anything that fragments further is going to cause more
issues for that kind of user.

On the other hand, there are potential users out there who want just a
handful of parsers, in a simple and easy and small way, who don't know a
lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those
are using OSGi, but not all.

One suggested solution is to just document what dependencies of
tika-parsers can be excluded at the maven level to disable certain
parsers 
+ shrink the resulting dependency tree. However, that requires manual
updates, manual checking, and like our examples on the website risk
getting out of date without automated checking.

Discussion then turned to our move to get all the examples for the
website 
into svn, with unit tests, and having the website pull those from svn on
the fly to always get the latest tested version.


That led to an idea. Not sure if it'll work yet, but...

What about having multiple Tika OSGi bundles? Continue with the full
bundle as now, but also have ones for pdf, microsoft office, images
etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if
they only wanted a handful of parsers, or the full one as now.

The smart bit - we have unit tests for these smaller bundles. These unit
tests ensure that the desired parsers still work on their smaller bundle.
These unit tests also ensure that unwanted parsers don't work, thus
flagging up if extra dependencies have snuck though.

Finally, we pull out the includes/excludes information that went into the
bundle, and display that for non-OSGi users. A non-OSGi person wanting
tika with pdf only could then look at what the tika-pdf-bundle does and
doesn't use, and from that know what maven level dependencies to keep and
which to exclude


This new plan would mean having to tweak our build to support multiple
bundles, and potentially tweaking our bundles so that you could load
tika-pdf + tika-image and have those two play nicely together. It'd also
need some new unit tests, and the work to figure out what to
include/exclude for each of our handful of common cases. It should,
however, deliver a way for OSGi and non-OSGi people to get just a subset
if that's all they want.

Can anyone see a flaw with this plan? Anyone see a better way? Anyone
want 
to help? :)

Nick



Subsets of tika parsers redux

2014-11-23 Thread Nick Burch

Hi All

During ApacheCon, I had a chance to chat with Sergey about the subset of 
Tika Parsers issue that bubbles up from time to time. It seemed to work 
well, and I think we both now have a better idea of the other's needs and 
concerns, which is good :)


As is shown on our list from time to time, but more commonly elsewhere, we 
have some users who are confused already by the split between tika-core 
and tika-parsers. Anything that fragments further is going to cause more 
issues for that kind of user.


On the other hand, there are potential users out there who want just a 
handful of parsers, in a simple and easy and small way, who don't know a 
lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those 
are using OSGi, but not all.


One suggested solution is to just document what dependencies of 
tika-parsers can be excluded at the maven level to disable certain parsers 
+ shrink the resulting dependency tree. However, that requires manual 
updates, manual checking, and like our examples on the website risk 
getting out of date without automated checking.


Discussion then turned to our move to get all the examples for the website 
into svn, with unit tests, and having the website pull those from svn on 
the fly to always get the latest tested version.



That led to an idea. Not sure if it'll work yet, but...

What about having multiple Tika OSGi bundles? Continue with the full 
bundle as now, but also have ones for pdf, microsoft office, images 
etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if 
they only wanted a handful of parsers, or the full one as now.


The smart bit - we have unit tests for these smaller bundles. These unit 
tests ensure that the desired parsers still work on their smaller bundle. 
These unit tests also ensure that unwanted parsers don't work, thus 
flagging up if extra dependencies have snuck though.


Finally, we pull out the includes/excludes information that went into the 
bundle, and display that for non-OSGi users. A non-OSGi person wanting 
tika with pdf only could then look at what the tika-pdf-bundle does and 
doesn't use, and from that know what maven level dependencies to keep and 
which to exclude



This new plan would mean having to tweak our build to support multiple 
bundles, and potentially tweaking our bundles so that you could load 
tika-pdf + tika-image and have those two play nicely together. It'd also 
need some new unit tests, and the work to figure out what to 
include/exclude for each of our handful of common cases. It should, 
however, deliver a way for OSGi and non-OSGi people to get just a subset 
if that's all they want.


Can anyone see a flaw with this plan? Anyone see a better way? Anyone want 
to help? :)


Nick