Re: Subsets of tika parsers redux
Hi Nick, I think I've actually learned a new urban dictionary word mentioned in this thread, 'faff' :-). On 16/12/14 03:34, Nick Burch wrote: On Mon, 15 Dec 2014, Sergey Beryozkin wrote: I'm not proposing to split tika-parsers in a way that would affect the users, tika-parsers would still be there, except that it would strongly depend on tika-pdf and perhaps, when it is being built, it can have its dependencies like tika-pdf shaded in/merged in to ensure a complete backward-compatibility as far as the user expectations of tika-parsers is concerned. We still have the additional faff of multiple core modules, which someone warned about in an earlier thread, and additional work for developers, and we did try pulling out the pdf parser which didn't work, and I'm finding having the Vorbis parsers in a different module + repo to be a faff I was thinking of introducing a very minimum number of extra modules (at the 'expense' of tika-parsers), those covering the mainstream parsers, the ones you mentioned earlier, pdf, plus few others. tika-parsers would still be effectively the same after the build time, no side-effects for the tika-parsers users. Perhaps it is difficult to realize practically... My plan doesn't involve any of those problems in phase 1 - core + parsers don't change at all, so if it doesn't work we haven't got to work hard to undo it, and people not interested aren't affected. Alternately, if you head back to some of the earlier threads on this, and can come up with reasons why the objections raised there can be overruled, we could hack up tika parsers. (I'm trying to come up with a plan that respects previously raised issues) Sounds good, thanks I might experiment a bit later on and create a patch for the review but I'll take a pause for now Cheers, Sergey Nick
Re: Subsets of tika parsers redux
On Mon, 15 Dec 2014, Sergey Beryozkin wrote: As a first step, I thought we'd still keep the same tika-parser jar, the only difference would be what dependencies ended up in the bundle. If the tika-bundle-pdf has no POI jars included in it, then the Microsoft Office related parsers shouldn't register themselves. It would mean that the pdf bundle would have the image, microsoft etc parser code in them, but the parsers wouldn't be registered as their dependencies wouldn't be there. Not sure if this can/will work, but it would mean we can do cut-down bundles + cut-down-maven-docs, without needing to change anything else. If it proves popular, we can then re-visit the giant tika parsers question, but if not it shouldn't change anything. Well, that's the theory... :) Sorry if I haven't completely understood the idea, I think there's definitely something nice being suggested above, and it sounds to me as if the following can be one possible realization of it, as a first step for example, - add a tika-pdf module, this will be a bundle, so it will work as a jar and as an OSGI bundle; the code for tika-pdf will be extracted (and removed) from tika-parsers Not quite - I forsee this being OSGi only for now. Tika Parsers project would be unchanged, OSGi users could have tika (all) as now, or just tika-pdf - tika-parsers will get updated to depend on tika-pdf - hence users working with tika-parsers won;t be affected No, that's a possible phase 2 if it goes well. No change for non-OSGi stuff. Non-OSGi users can see the OSGi build to work out what to include and exclude if they want. (This means that we have a unit tested way to see what you do/don't want, without affecting things for the simple Tika users we get confused already with tika-core + tika-parsers) - those users who want working with PDF only would ad tika-core + tika-pdf dependencies only OSGi users would pick tika + tika-parsers, or tika + tika-parsers-pdf, or tika + tika-parsers-pdf + tika-parsers-mp3 if they want OSGi is nicely contained, and fairly easy to unit test, so let's use that to test out the idea! That also solves the CXF need. Once that works, and once we have a tested way that everyone can see + understand, then someone can try to make the case for phase II where we push it to the maven pom / project level! Nick
Re: Subsets of tika parsers redux
Hi Nick, On 15/12/14 14:02, Nick Burch wrote: On Mon, 15 Dec 2014, Sergey Beryozkin wrote: As a first step, I thought we'd still keep the same tika-parser jar, the only difference would be what dependencies ended up in the bundle. If the tika-bundle-pdf has no POI jars included in it, then the Microsoft Office related parsers shouldn't register themselves. It would mean that the pdf bundle would have the image, microsoft etc parser code in them, but the parsers wouldn't be registered as their dependencies wouldn't be there. Not sure if this can/will work, but it would mean we can do cut-down bundles + cut-down-maven-docs, without needing to change anything else. If it proves popular, we can then re-visit the giant tika parsers question, but if not it shouldn't change anything. Well, that's the theory... :) Sorry if I haven't completely understood the idea, I think there's definitely something nice being suggested above, and it sounds to me as if the following can be one possible realization of it, as a first step for example, - add a tika-pdf module, this will be a bundle, so it will work as a jar and as an OSGI bundle; the code for tika-pdf will be extracted (and removed) from tika-parsers Not quite - I forsee this being OSGi only for now. Tika Parsers project would be unchanged, OSGi users could have tika (all) as now, or just tika-pdf - tika-parsers will get updated to depend on tika-pdf - hence users working with tika-parsers won;t be affected No, that's a possible phase 2 if it goes well. No change for non-OSGi stuff. Non-OSGi users can see the OSGi build to work out what to include and exclude if they want. (This means that we have a unit tested way to see what you do/don't want, without affecting things for the simple Tika users we get confused already with tika-core + tika-parsers) - those users who want working with PDF only would ad tika-core + tika-pdf dependencies only OSGi users would pick tika + tika-parsers, or tika + tika-parsers-pdf, or tika + tika-parsers-pdf + tika-parsers-mp3 if they want OSGi is nicely contained, and fairly easy to unit test, so let's use that to test out the idea! That also solves the CXF need. Once that works, and once we have a tested way that everyone can see + understand, then someone can try to make the case for phase II where we push it to the maven pom / project level! The need of CXF (Tika) users (or of some other users with possibly similar requirements) is not about shipping OSGI only Tika modules but about having an easy option of not to having include all the tika-parsers. Some CXF users would work with OSGI, some not. Sorry if I did not clarify it. As I said, a module marked as bundle, as opposed to a default 'jar' is just a plain jar with few extra META-INF instructions. Given it, I'm not understanding why you are opposed to not having tika-parsers minimized as I suggested ? What exactly is your concern ? Shipping something like tika-pdf but still keeping the PDF parsing code inside tika-parsers is a duplication, right ? Thanks, Sergey Nick
Re: Subsets of tika parsers redux
On Mon, 15 Dec 2014, Sergey Beryozkin wrote: I'm not proposing to split tika-parsers in a way that would affect the users, tika-parsers would still be there, except that it would strongly depend on tika-pdf and perhaps, when it is being built, it can have its dependencies like tika-pdf shaded in/merged in to ensure a complete backward-compatibility as far as the user expectations of tika-parsers is concerned. We still have the additional faff of multiple core modules, which someone warned about in an earlier thread, and additional work for developers, and we did try pulling out the pdf parser which didn't work, and I'm finding having the Vorbis parsers in a different module + repo to be a faff My plan doesn't involve any of those problems in phase 1 - core + parsers don't change at all, so if it doesn't work we haven't got to work hard to undo it, and people not interested aren't affected. Alternately, if you head back to some of the earlier threads on this, and can come up with reasons why the objections raised there can be overruled, we could hack up tika parsers. (I'm trying to come up with a plan that respects previously raised issues) Nick
Re: Subsets of tika parsers redux
Hi Nick Was good talking to you and thanks for initiating this thread. It is an interesting idea, one that can lead to introducing finer-grained bundles but also providing a mechanism for the (auto-)generation of the import metadata required by each of the parser modules. Besides, introducing several smaller bundles that would group most popular formats is a good one on its own IMHO. My doubt here is how many of those bundles we'd need to create and if it will make it easy for users to get a task like Get a parser for the format A only, or parsers A and B formats only done. Are we talking about introducing a parser module per every supported format, and having tika-parsers depend on all of those modules, with every parser module becoming a bundle (a jar plus an entry in the manifest) ? Thanks, Sergey On 23/11/14 17:12, Nick Burch wrote: Hi All During ApacheCon, I had a chance to chat with Sergey about the subset of Tika Parsers issue that bubbles up from time to time. It seemed to work well, and I think we both now have a better idea of the other's needs and concerns, which is good :) As is shown on our list from time to time, but more commonly elsewhere, we have some users who are confused already by the split between tika-core and tika-parsers. Anything that fragments further is going to cause more issues for that kind of user. On the other hand, there are potential users out there who want just a handful of parsers, in a simple and easy and small way, who don't know a lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those are using OSGi, but not all. One suggested solution is to just document what dependencies of tika-parsers can be excluded at the maven level to disable certain parsers + shrink the resulting dependency tree. However, that requires manual updates, manual checking, and like our examples on the website risk getting out of date without automated checking. Discussion then turned to our move to get all the examples for the website into svn, with unit tests, and having the website pull those from svn on the fly to always get the latest tested version. That led to an idea. Not sure if it'll work yet, but... What about having multiple Tika OSGi bundles? Continue with the full bundle as now, but also have ones for pdf, microsoft office, images etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if they only wanted a handful of parsers, or the full one as now. The smart bit - we have unit tests for these smaller bundles. These unit tests ensure that the desired parsers still work on their smaller bundle. These unit tests also ensure that unwanted parsers don't work, thus flagging up if extra dependencies have snuck though. Finally, we pull out the includes/excludes information that went into the bundle, and display that for non-OSGi users. A non-OSGi person wanting tika with pdf only could then look at what the tika-pdf-bundle does and doesn't use, and from that know what maven level dependencies to keep and which to exclude This new plan would mean having to tweak our build to support multiple bundles, and potentially tweaking our bundles so that you could load tika-pdf + tika-image and have those two play nicely together. It'd also need some new unit tests, and the work to figure out what to include/exclude for each of our handful of common cases. It should, however, deliver a way for OSGi and non-OSGi people to get just a subset if that's all they want. Can anyone see a flaw with this plan? Anyone see a better way? Anyone want to help? :) Nick
Re: Subsets of tika parsers redux
Hey Nick, This sounds like a great plan to me, good job to you and Sergey. As for helping I¹ll try my best, but I¹m not an OSGI guru :) Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Nick Burch n...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Sunday, November 23, 2014 at 6:12 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Subsets of tika parsers redux Hi All During ApacheCon, I had a chance to chat with Sergey about the subset of Tika Parsers issue that bubbles up from time to time. It seemed to work well, and I think we both now have a better idea of the other's needs and concerns, which is good :) As is shown on our list from time to time, but more commonly elsewhere, we have some users who are confused already by the split between tika-core and tika-parsers. Anything that fragments further is going to cause more issues for that kind of user. On the other hand, there are potential users out there who want just a handful of parsers, in a simple and easy and small way, who don't know a lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those are using OSGi, but not all. One suggested solution is to just document what dependencies of tika-parsers can be excluded at the maven level to disable certain parsers + shrink the resulting dependency tree. However, that requires manual updates, manual checking, and like our examples on the website risk getting out of date without automated checking. Discussion then turned to our move to get all the examples for the website into svn, with unit tests, and having the website pull those from svn on the fly to always get the latest tested version. That led to an idea. Not sure if it'll work yet, but... What about having multiple Tika OSGi bundles? Continue with the full bundle as now, but also have ones for pdf, microsoft office, images etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if they only wanted a handful of parsers, or the full one as now. The smart bit - we have unit tests for these smaller bundles. These unit tests ensure that the desired parsers still work on their smaller bundle. These unit tests also ensure that unwanted parsers don't work, thus flagging up if extra dependencies have snuck though. Finally, we pull out the includes/excludes information that went into the bundle, and display that for non-OSGi users. A non-OSGi person wanting tika with pdf only could then look at what the tika-pdf-bundle does and doesn't use, and from that know what maven level dependencies to keep and which to exclude This new plan would mean having to tweak our build to support multiple bundles, and potentially tweaking our bundles so that you could load tika-pdf + tika-image and have those two play nicely together. It'd also need some new unit tests, and the work to figure out what to include/exclude for each of our handful of common cases. It should, however, deliver a way for OSGi and non-OSGi people to get just a subset if that's all they want. Can anyone see a flaw with this plan? Anyone see a better way? Anyone want to help? :) Nick
Subsets of tika parsers redux
Hi All During ApacheCon, I had a chance to chat with Sergey about the subset of Tika Parsers issue that bubbles up from time to time. It seemed to work well, and I think we both now have a better idea of the other's needs and concerns, which is good :) As is shown on our list from time to time, but more commonly elsewhere, we have some users who are confused already by the split between tika-core and tika-parsers. Anything that fragments further is going to cause more issues for that kind of user. On the other hand, there are potential users out there who want just a handful of parsers, in a simple and easy and small way, who don't know a lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those are using OSGi, but not all. One suggested solution is to just document what dependencies of tika-parsers can be excluded at the maven level to disable certain parsers + shrink the resulting dependency tree. However, that requires manual updates, manual checking, and like our examples on the website risk getting out of date without automated checking. Discussion then turned to our move to get all the examples for the website into svn, with unit tests, and having the website pull those from svn on the fly to always get the latest tested version. That led to an idea. Not sure if it'll work yet, but... What about having multiple Tika OSGi bundles? Continue with the full bundle as now, but also have ones for pdf, microsoft office, images etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if they only wanted a handful of parsers, or the full one as now. The smart bit - we have unit tests for these smaller bundles. These unit tests ensure that the desired parsers still work on their smaller bundle. These unit tests also ensure that unwanted parsers don't work, thus flagging up if extra dependencies have snuck though. Finally, we pull out the includes/excludes information that went into the bundle, and display that for non-OSGi users. A non-OSGi person wanting tika with pdf only could then look at what the tika-pdf-bundle does and doesn't use, and from that know what maven level dependencies to keep and which to exclude This new plan would mean having to tweak our build to support multiple bundles, and potentially tweaking our bundles so that you could load tika-pdf + tika-image and have those two play nicely together. It'd also need some new unit tests, and the work to figure out what to include/exclude for each of our handful of common cases. It should, however, deliver a way for OSGi and non-OSGi people to get just a subset if that's all they want. Can anyone see a flaw with this plan? Anyone see a better way? Anyone want to help? :) Nick