Hi Tim, All Sure, agree that Tika is not really about the transformation. etc, it is just not what I was suggesting, even though I started with a link to IHTML to PRD transformer. Let me just clarify one more time and I'll be happy to move on. So, trying to put it into a practical surface: - create a tika-format-creator (or similarly named) module - introduce a simple generic API (similarly to the prototype API earlier in the thread) for creating simple format specific docs and document it is going to stay experimental for a while - this API is not about transformation but for Tika users to create the docs directly - provide two implementations of this API for a start only, one for PDF, another one for ODT. In time it may grow a bit to support few more most used formats, no goal to support hundreds of formats. (This is why I don't understand the maintenance concern :-) )
In the end the users would be able to use Tika specific API to read and for some most used formats - create docs. Tika appeal is about having the uniform API for reading N formats, so the users don't have to have a code switching between N format specific parser APIs. But the users working with Tika and having an additional task of creating some formats still have to go beyond Tika...ending up with a semi-generic code after all. That was the idea I tried to convey earlier in the thread... Thanks all, Sergey On Wed, Oct 16, 2019 at 5:07 PM Tim Allison <talli...@apache.org> wrote: > +1 to Ken’s earlier point about maintenance. Note Tika wouldn’t even build > in Germany, and we only discovered that because of inviting Tilman. :D We > have a huge amount of maintenance already... > > Checkout the incubating Daffodil project that aims to convert files to xml, > validate them and then serialize back to original format. > > I do see a use for transform() and if we could use xhtml as an > intermediary, then...maybe, but My inclination is w Ken. > > On Wed, Oct 16, 2019 at 11:50 AM Ken Krugler <kkrug...@apache.org> wrote: > > > I can see the attraction of one API to convert XHTML to various formats. > > > > Though very quickly that simple API would become complex, as each target > > format has its own conversion options. > > > > And if successful, we’d pull in even more 3rd party jars to handle that > > conversion. > > > > Wonder if there’s a need for a new project called “Akit”, which focuses > on > > XHTML -> various formats :) > > > > — Ken > > > > > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin <sberyoz...@gmail.com> > > wrote: > > > > > > Ken, thanks for the feedback, I meant to reply to your comments, > > > > > > I suppose I really meant Tika offering a uniform API to create some > > simple > > > structured PDF/etc files. > > > ContentCreator creator = ContentCreator.get("PDF"); > > > creator.addTitle("Introduction to Tika"); > > > creator.addText(""); > > > creator.addTable("tablename", new LinkedHashMap<String, > List<String>>()); > > > creator.addAttachment(someImage); > > > creator.complete(); > > > > > > It would be consistent with the Tika approach on the read side. > > > > > > Cheers, Sergey > > > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <kkrug...@apache.org> > wrote: > > > > > >> If you’re suggesting ways to make it easier to use something like > > >> YaHPConverter with Tika, definitely yes. > > >> > > >> If you’re talking about integrating this functionality…my personal > view > > is > > >> no. > > >> > > >> I think Tika should focus on extracting content from documents, versus > > >> format transformations. > > >> > > >> Tika is an attractive location for functionality like this, since it > > sits > > >> in the middle of a lot of data processing pipelines, but I worry > about a > > >> bloated code base, with corresponding challenges in maintenance and > > support. > > >> > > >> Regards, > > >> > > >> — Ken > > >> > > >> > > >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <sberyoz...@gmail.com> > > >> wrote: > > >>> > > >>> Hi All > > >>> > > >>> I've seen a Quarkus user asking how to convert to PDF, and one of my > > >>> colleagues pointed to > > >>> > > >> > > > http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html > > >>> > > >>> Does it make sense for Tika to offer something related to the text to > > PDF > > >>> (for a start, something on top of that transformer), and then may be > > even > > >>> for other formats ? > > >>> > > >>> Sergey > > >> > > >> -------------------------- > > >> Ken Krugler > > >> http://www.scaleunlimited.com > > >> custom big data solutions & training > > >> Hadoop, Cascading, Cassandra & Solr > > >> > > >> > > > > -------------------------- > > Ken Krugler > > http://www.scaleunlimited.com > > custom big data solutions & training > > Hadoop, Cascading, Cassandra & Solr > > > > >