Re: DocFormats - Open source OOXML implementation

jan i Sat, 16 Aug 2014 01:11:08 -0700

On 16 August 2014 03:50, Peter Kelly <kelly...@gmail.com> wrote:

> On 16 Aug 2014, at 5:26 am, Andrea Pescetti <pesce...@apache.org> wrote:
>
> On 15/08/2014 Peter Kelly wrote:
>
> Those of you interested in OOXML may want to have a look at my own
> implementation of (a subset of) the spec, which is part of a library
> I've just made available as open source (license is ASLv2):
> https://github.com/uxproductivity/DocFormats
>
>
> It's very interesting. I hope that in future it may become relevant to
> OpenOffice or to Apache at large.
>
> The design is based on bidirectional transformation, as a way of
> achieving non-destructive editing of foreign file formats. This permits
> incremental implementation of a given spec without risking data loss due
> to incomplete features, since unsupported features of a given file
> format are left untouched on save.
>
>
> Does this mean that
> $ dfutil/dfutil filename.docx filename.html
> $ dfutil/dfutil filename.html filename2.docx
> should produce a "filename2.docx" that is quite similar to
> "filename.docx"? It is failing rather badly (invalid OOXML output in the
> second conversion, ZIP container clearly missing files and possible
> breaking order) in a simple test I did with a 1-page docx file.
>
>
> I'm not surprised this is the first issue to come up :$ There's a *lot* of
> knowledge I need to document for others; questions from you and others are
> the best way to motivate me to get that written ;)
>
> What's happening here is that when the filename.html produced in the first
> step, each of its elements contains an id attribute containing a numeric
> identifier that refers to a specific element in the source docx file
> (specifically, the word/document.xml file within the package). These
> numeric identifiers are generated during parsing, and correspond to the
> position of the element in document order (so 1, 2, 3, etc.). When you
> convert from HTML to .docx, it uses the id attributes to re-establish these
> relationships, so that it knows which elements in the HTML file correspond
> to which elements in the .docx file.
>
> The problem you encountered stems from the fact that this mapping is only
> valid in specific circumstances - that is, when the .docx file being
> updated is exactly the same as its original. If this is not the case, then
> the identifier assigned to a given node will different whenever there are
> other nodes that have been inserted between it. So for example if you do
> the following:
>
> dfutil filename.docx filename.html
> # Modify filename.html
> dfutil filename.html filename.docx
> dfutil filename.html filename.docx
>
> Then the third run will fail, because in the second the docx file will
> have been updated based on the changes in the HTML, changing the sequence
> numbers assigned to each node, and then on the second run the mapping will
> be valid. The conversion works on the assumption that the docx file is the
> same as the original. The way that UX Write uses the library, it ensures
> this is the case, but the library does not check for this (and yes, it
> should; more on this below).
>
> Your case is similar, though in this case you're creating a new docx file,
> not updating an existing one. However what it actually does in this case is
> to create an empty .docx file, and then "update" that based on the HTML. In
> doing so, it assumes that the HTML does not contain any mappings (that is,
> id attributes with the prefix "bdt"). Since the filename.html you generated
> does, it tries to map these to elements in the docx file, failing badly.
>
> The only workaround for this at present is to manually edit the HTML file
> and remove all id attributes. The quickest way to do this is with the
> following command:
>
> sed -i '' -E ' s/ id="word[0-9]+"//' filename.html
>
> Then, when you run dfutil, it will see that there is no mapping for any of
> the elements in the HTML file, and thus avoid the problems in the output
> you observed.
>
> Now, onto the fix:
>
> The library needs to have some way of checking that the HTML file being
> used as part of an update operation has a mapping (id attributes) that
> match the docx file being updated (in the case of creating a new file, this
> is just an empty docx file). In the even that this is not the case, it
> could still do the update, but would act as if the entire document had been
> replaced with a completely new one.
>
> The solution I'll likely implement (and this should really be my first
> task, given the potential for problems like the above is this):
>
In my humble opinion you should not use time on this right now.


If you fix a bug we have a 1-1 relation (1 man used, 1 bug fixed)
If you start getting the documentation right we have a 1-n relations (1 man
used, n men help fix bugs).

Please have in mind, we build a community in order to move away from "I
have to do it, because I am the only one who know how" and you are the most
important enabler of that......we need your knowledge in a file, so that
others can work.



>
> - Include a hash of the .docx file (or relevant parts of it) in the HTML
> file, e.g. as a meta element or as part of the prefix on all id attributes
> - On update, have re-compute the hash of the .docx file and compare it
> against the one stored in the HTML file (if any), and if there's no match,
> treat the HTML file as a complete replacement of all content
>
>
>
> What is the best channel to report issues?
>
>  For now its surely to mail peter. but peter, can you please make a bug
directory, and put emails as plain text in there, so we have a reference.
Idealy the mails should be numbered, and the fixed cary the same number in
the commit text.

When the project (hopefully) enters incubator, we will automatically have
access to a bug tracking system (jira), and with that hopefully only being
some month away I would not recommend setting up one now.

@andrea thanks a lot for your test, and a little bit of background, peter
separated the closed source project and the new open source project, it
seems it was done a bit hastely (we are all highly motivated to get this
going).

@andrea, patches will be most welcome, due to a recommendation from jake f.
(infra) we have made the repo RO, but I or peter will make sure your
patches goes into the code base very quickly.

rgds
jan I

--
> Dr. Peter M. Kelly
> Founder, UX Productivity
> pe...@uxproductivity.com
> http://www.uxproductivity.com/
> http://www.kellypmk.net/
>
> PGP key: http://www.kellypmk.net/pgp-key
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>

Re: DocFormats - Open source OOXML implementation

Reply via email to