On 16 August 2014 03:50, Peter Kelly <kelly...@gmail.com> wrote: > On 16 Aug 2014, at 5:26 am, Andrea Pescetti <pesce...@apache.org> wrote: > > On 15/08/2014 Peter Kelly wrote: > > Those of you interested in OOXML may want to have a look at my own > implementation of (a subset of) the spec, which is part of a library > I've just made available as open source (license is ASLv2): > https://github.com/uxproductivity/DocFormats > > > It's very interesting. I hope that in future it may become relevant to > OpenOffice or to Apache at large. > > The design is based on bidirectional transformation, as a way of > achieving non-destructive editing of foreign file formats. This permits > incremental implementation of a given spec without risking data loss due > to incomplete features, since unsupported features of a given file > format are left untouched on save. > > > Does this mean that > $ dfutil/dfutil filename.docx filename.html > $ dfutil/dfutil filename.html filename2.docx > should produce a "filename2.docx" that is quite similar to > "filename.docx"? It is failing rather badly (invalid OOXML output in the > second conversion, ZIP container clearly missing files and possible > breaking order) in a simple test I did with a 1-page docx file. > > > I'm not surprised this is the first issue to come up :$ There's a *lot* of > knowledge I need to document for others; questions from you and others are > the best way to motivate me to get that written ;) > > What's happening here is that when the filename.html produced in the first > step, each of its elements contains an id attribute containing a numeric > identifier that refers to a specific element in the source docx file > (specifically, the word/document.xml file within the package). These > numeric identifiers are generated during parsing, and correspond to the > position of the element in document order (so 1, 2, 3, etc.). When you > convert from HTML to .docx, it uses the id attributes to re-establish these > relationships, so that it knows which elements in the HTML file correspond > to which elements in the .docx file. > > The problem you encountered stems from the fact that this mapping is only > valid in specific circumstances - that is, when the .docx file being > updated is exactly the same as its original. If this is not the case, then > the identifier assigned to a given node will different whenever there are > other nodes that have been inserted between it. So for example if you do > the following: > > dfutil filename.docx filename.html > # Modify filename.html > dfutil filename.html filename.docx > dfutil filename.html filename.docx > > Then the third run will fail, because in the second the docx file will > have been updated based on the changes in the HTML, changing the sequence > numbers assigned to each node, and then on the second run the mapping will > be valid. The conversion works on the assumption that the docx file is the > same as the original. The way that UX Write uses the library, it ensures > this is the case, but the library does not check for this (and yes, it > should; more on this below). > > Your case is similar, though in this case you're creating a new docx file, > not updating an existing one. However what it actually does in this case is > to create an empty .docx file, and then "update" that based on the HTML. In > doing so, it assumes that the HTML does not contain any mappings (that is, > id attributes with the prefix "bdt"). Since the filename.html you generated > does, it tries to map these to elements in the docx file, failing badly. > > The only workaround for this at present is to manually edit the HTML file > and remove all id attributes. The quickest way to do this is with the > following command: > > sed -i '' -E ' s/ id="word[0-9]+"//' filename.html > > Then, when you run dfutil, it will see that there is no mapping for any of > the elements in the HTML file, and thus avoid the problems in the output > you observed. > > Now, onto the fix: > > The library needs to have some way of checking that the HTML file being > used as part of an update operation has a mapping (id attributes) that > match the docx file being updated (in the case of creating a new file, this > is just an empty docx file). In the even that this is not the case, it > could still do the update, but would act as if the entire document had been > replaced with a completely new one. > > The solution I'll likely implement (and this should really be my first > task, given the potential for problems like the above is this): > In my humble opinion you should not use time on this right now.
If you fix a bug we have a 1-1 relation (1 man used, 1 bug fixed) If you start getting the documentation right we have a 1-n relations (1 man used, n men help fix bugs). Please have in mind, we build a community in order to move away from "I have to do it, because I am the only one who know how" and you are the most important enabler of that......we need your knowledge in a file, so that others can work. > > - Include a hash of the .docx file (or relevant parts of it) in the HTML > file, e.g. as a meta element or as part of the prefix on all id attributes > - On update, have re-compute the hash of the .docx file and compare it > against the one stored in the HTML file (if any), and if there's no match, > treat the HTML file as a complete replacement of all content > > > > What is the best channel to report issues? > > For now its surely to mail peter. but peter, can you please make a bug directory, and put emails as plain text in there, so we have a reference. Idealy the mails should be numbered, and the fixed cary the same number in the commit text. When the project (hopefully) enters incubator, we will automatically have access to a bug tracking system (jira), and with that hopefully only being some month away I would not recommend setting up one now. @andrea thanks a lot for your test, and a little bit of background, peter separated the closed source project and the new open source project, it seems it was done a bit hastely (we are all highly motivated to get this going). @andrea, patches will be most welcome, due to a recommendation from jake f. (infra) we have made the repo RO, but I or peter will make sure your patches goes into the code base very quickly. rgds jan I -- > Dr. Peter M. Kelly > Founder, UX Productivity > pe...@uxproductivity.com > http://www.uxproductivity.com/ > http://www.kellypmk.net/ > > PGP key: http://www.kellypmk.net/pgp-key > (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966) > >