Re: Document conversion engine
Hi Michael, nice to ear from someone so up the ranks like you.. makes me feel much more important :-) 2012/7/6 Michael Meeks michael.me...@suse.com Hi Flavio, On Tue, 2012-07-03 at 11:45 +0100, Flavio Moringa wrote: my name is Flávio Moringa, I'm from Portugal and I'm starting my Masters Dissertation next September (Master in Open Source software - http://moss.dcti.iscte.pt ). Welcome :-) Thanks I'm not a programmer, so what I'm interested in doing is something in the lines of investigating the main conversion problems, identifying the possible conversion flows, analysing the way the conversion flow is implemented in LibreOffice, and eventually trying to improve this flow somehow. So - it will be hard to improve the flow without being a programmer I'm afraid :-) well, although not a programmer right now I've had my fair share of perl, python, c, bash, java, php... maybe I'm not so fluent in programming right now, but I'm certainly no strange to it, and definitely not afraid to do it if the need arises... what I meant was that I'll probably wont't be able to do a conversion engine by myself... but I can definitely mess around with code... From your reply I assume that testing the filters, and doing regression tests is something I could do, maybe identifying the main conversion issues in groups of documents and kind of creating a major conversion issues table, and prioritizing those issues. Is there already something like that? There is a useful QA role in prioritising bug reports and interoperability issues; we have a real problem with masses of bug reports many of which could be duplicates. Having said that - interoperability has many, many known feature / impedance mis-matches that are non-trivial development problems to fix. One thing that -would- be really useful, and that Microsoft have internally, is an analysis tool for Microsoft's XML document formats - such that we can get a good idea of which attributes are actually used much. ie. by analysing and comparing a large corpus of documents out there, we can answer questions such as: should we implement surface charts, or 3D doughnut charts ? given whatever amount of feature-development time we have - simply by referring to the database of crunched XML files to work out which one is used most. It'd be nice to have that for ODF as well too of course for when we have to make zero-sum back-compatibility decisions; but for interoperability crunching those MS documents would be really good. Is that something you could do ? a bit of perl, zip extraction, XML parsing, etc. ? Yes, it's definitely something I can do... I do believe that the harder part is getting that large corpus of documents out there At least as my experience goes, I've found that it's hard to get users to send us documents they use... either due to privacy questions or enterprise policies... But a tool like that makes a lot of sense Developers are -much- more likely to let themselves be lead by objective statistics on real documents out there, rather than subjective feelings of priority - which can prove rather controversial :-) I can certainly relate to that... Thanks ! For now then I'll start doing as you suggest and look in bugzilla for documents with conversion problems to try and compile as much examples as I can. Then maybe using the latest beta to do the conversion and see which problems are still there. Then maybe starting a perl script that can scrap the OOXML files to find the most used tags... and start from there... Michael. -- michael.me...@suse.com , Pseudo Engineer, itinerant idiot Thanks a lot for helping out. Cheers -- *Flávio Moringa* Project Leader Caixa Mágica Software Energia Open Source Rua Soeiro Pereira Gomes, Lote 1 - 4.º B, Edifício Espanha, 1600-196 Lisboa - Portugal Tel.: +351 217 921 260 Fax: +351 217 921 261 http://www.caixamagica.pt https://twitter.com/flaviomoringa https://www.facebook.com/flaviomoringahttps://www.facebook.com/flavio.moringa http://pt.linkedin.com/in/flaviomoringa http://people.caixamagica.pt/flaviomoringa ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Document conversion engine
Hi Robinson, 2012/7/6 Robinson Tryon bishop.robin...@gmail.com On Fri, Jul 6, 2012 at 5:51 AM, Flavio Moringa flavio.mori...@caixamagica.pt wrote: I know that you can convert documents through the command line, using LibreOffice headless mode, and that can be something that's useful for scripting automatic tests... although I know that sometimes the main problems are visual and it's difficult to automatically detect the problems... I think that we still need human eyes for the final comparison, however the rest of the system could be automated a bit more -- e.g. we could put sample docs in subdirectories named by bug# and add screenshots of the docs as rendered in MS-Office; add in a script to have LO iterate over the subdirectories and spit out screenshots of how it renders the original files, and a little HTML GUI so that you can tab-through 2-ups of the original rendering vs. LO's rendering, and you've got a decent tool for testing improvements/regressions. That's a good ideia, at least it would facilitate the testing, which is always very helpful... From my initial investigation in document conversion, the visual aspect is alwayst the difficult one because not all thins are well translated to the XML... Is there any kind of repository for documents that are candidates for conversion testing? I mean documents which are known to have conversion problems, and that are used to test improvements to the filters? I usually just search bugzilla for conversion or formatting :-) Even documents attached to old bugs can be helpful, as they can serve as regression tests. I've just replied to Michael Meeks that I'll do just that... try to compile a list of available documents in bugzilla that have conversion problems, and test them on the latest beta... And see which problems still exist... I would like very much to become more involved in improving the conversion filters, since it seems to be a major problem in LibreOffice adoption, and everything that can be done to help in that area would certainly boost LibreOffice adoption specially in the enterprise world. Yes, fidelity of document rendering is definitely one of the biggest hurdles I've faced when encouraging people to try LO. Any improvements on that front will be greatly appreciated! --R Thanks -- *Flávio Moringa* Project Leader Caixa Mágica Software Energia Open Source Rua Soeiro Pereira Gomes, Lote 1 - 4.º B, Edifício Espanha, 1600-196 Lisboa - Portugal Tel.: +351 217 921 260 Fax: +351 217 921 261 http://www.caixamagica.pt https://twitter.com/flaviomoringa https://www.facebook.com/flaviomoringahttps://www.facebook.com/flavio.moringa http://pt.linkedin.com/in/flaviomoringa http://people.caixamagica.pt/flaviomoringa ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice-qa] Document conversion engine
Hi Michael, nice to ear from someone so up the ranks like you.. makes me feel much more important :-) 2012/7/6 Michael Meeks michael.me...@suse.com Hi Flavio, On Tue, 2012-07-03 at 11:45 +0100, Flavio Moringa wrote: my name is Flávio Moringa, I'm from Portugal and I'm starting my Masters Dissertation next September (Master in Open Source software - http://moss.dcti.iscte.pt ). Welcome :-) Thanks I'm not a programmer, so what I'm interested in doing is something in the lines of investigating the main conversion problems, identifying the possible conversion flows, analysing the way the conversion flow is implemented in LibreOffice, and eventually trying to improve this flow somehow. So - it will be hard to improve the flow without being a programmer I'm afraid :-) well, although not a programmer right now I've had my fair share of perl, python, c, bash, java, php... maybe I'm not so fluent in programming right now, but I'm certainly no strange to it, and definitely not afraid to do it if the need arises... what I meant was that I'll probably wont't be able to do a conversion engine by myself... but I can definitely mess around with code... From your reply I assume that testing the filters, and doing regression tests is something I could do, maybe identifying the main conversion issues in groups of documents and kind of creating a major conversion issues table, and prioritizing those issues. Is there already something like that? There is a useful QA role in prioritising bug reports and interoperability issues; we have a real problem with masses of bug reports many of which could be duplicates. Having said that - interoperability has many, many known feature / impedance mis-matches that are non-trivial development problems to fix. One thing that -would- be really useful, and that Microsoft have internally, is an analysis tool for Microsoft's XML document formats - such that we can get a good idea of which attributes are actually used much. ie. by analysing and comparing a large corpus of documents out there, we can answer questions such as: should we implement surface charts, or 3D doughnut charts ? given whatever amount of feature-development time we have - simply by referring to the database of crunched XML files to work out which one is used most. It'd be nice to have that for ODF as well too of course for when we have to make zero-sum back-compatibility decisions; but for interoperability crunching those MS documents would be really good. Is that something you could do ? a bit of perl, zip extraction, XML parsing, etc. ? Yes, it's definitely something I can do... I do believe that the harder part is getting that large corpus of documents out there At least as my experience goes, I've found that it's hard to get users to send us documents they use... either due to privacy questions or enterprise policies... But a tool like that makes a lot of sense Developers are -much- more likely to let themselves be lead by objective statistics on real documents out there, rather than subjective feelings of priority - which can prove rather controversial :-) I can certainly relate to that... Thanks ! For now then I'll start doing as you suggest and look in bugzilla for documents with conversion problems to try and compile as much examples as I can. Then maybe using the latest beta to do the conversion and see which problems are still there. Then maybe starting a perl script that can scrap the OOXML files to find the most used tags... and start from there... Michael. -- michael.me...@suse.com , Pseudo Engineer, itinerant idiot Thanks a lot for helping out. Cheers -- *Flávio Moringa* Project Leader Caixa Mágica Software Energia Open Source Rua Soeiro Pereira Gomes, Lote 1 - 4.º B, Edifício Espanha, 1600-196 Lisboa - Portugal Tel.: +351 217 921 260 Fax: +351 217 921 261 http://www.caixamagica.pt https://twitter.com/flaviomoringa https://www.facebook.com/flaviomoringahttps://www.facebook.com/flavio.moringa http://pt.linkedin.com/in/flaviomoringa http://people.caixamagica.pt/flaviomoringa ___ List Name: Libreoffice-qa mailing list Mail address: Libreoffice-qa@lists.freedesktop.org Change settings: http://lists.freedesktop.org/mailman/listinfo/libreoffice-qa Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://lists.freedesktop.org/archives/libreoffice-qa/
Re: Document conversion engine
Hi Michael, first of all thanks for replying.. I was thinking no one would :-) From your reply I assume that testing the filters, and doing regression tests is something I could do, maybe identifying the main conversion issues in groups of documents and kind of creating a major conversion issues table, and prioritizing those issues. Is there already something like that? I know that you can convert documents through the command line, using LibreOffice headless mode, and that can be something that's useful for scripting automatic tests... although I know that sometimes the main problems are visual and it's difficult to automatically detect the problems... Is there any kind of repository for documents that are candidates for conversion testing? I mean documents which are known to have conversion problems, and that are used to test improvements to the filters? I would like very much to become more involved in improving the conversion filters, since it seems to be a major problem in LibreOffice adoption, and everything that can be done to help in that area would certainly boost LibreOffice adoption specially in the enterprise world. Thanks Flávio 2012/7/5 Michael Stahl mst...@redhat.com hi Flavio, On 03/07/12 12:45, Flavio Moringa wrote: I chose as my masters dissertation investigation topic trying to improve the document conversion engine in LibreOffice (ex: converting docx to odt), and as such I would like to know who is working on the conversion engines and how can I help. the document conversion engines in LibreOffice are called Writer, Calc, Draw and Impress. conversion from e.g. DOCX to ODT happens by importing the DOCX file with the DOCX import filter into Writer, and then exporting the document from Writer with the ODF export filter. there are also a few filters (such as XSLT filters, and writerperfect if i remember correctly) that use ODF as an intermediate format, i.e., they import by converting their format to ODF and then importing that into the LO application, and export the reverse way. I'm not a programmer, so what I'm interested in doing is something in the lines of investigating the main conversion problems, identifying the possible conversion flows, analysing the way the conversion flow is implemented in LibreOffice, and eventually trying to improve this flow somehow. it seems to me the main conversion problem is a lack of manpower to improve the filters. oh, and more regression tests would be useful. -- *Flávio Moringa* Project Leader Caixa Mágica Software Energia Open Source Rua Soeiro Pereira Gomes, Lote 1 - 4.º B, Edifício Espanha, 1600-196 Lisboa - Portugal Tel.: +351 217 921 260 Fax: +351 217 921 261 http://www.caixamagica.pt https://twitter.com/flaviomoringa https://www.facebook.com/flaviomoringahttps://www.facebook.com/flavio.moringa http://pt.linkedin.com/in/flaviomoringa http://people.caixamagica.pt/flaviomoringa ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Document conversion engine
On Fri, Jul 6, 2012 at 5:51 AM, Flavio Moringa flavio.mori...@caixamagica.pt wrote: I know that you can convert documents through the command line, using LibreOffice headless mode, and that can be something that's useful for scripting automatic tests... although I know that sometimes the main problems are visual and it's difficult to automatically detect the problems... I think that we still need human eyes for the final comparison, however the rest of the system could be automated a bit more -- e.g. we could put sample docs in subdirectories named by bug# and add screenshots of the docs as rendered in MS-Office; add in a script to have LO iterate over the subdirectories and spit out screenshots of how it renders the original files, and a little HTML GUI so that you can tab-through 2-ups of the original rendering vs. LO's rendering, and you've got a decent tool for testing improvements/regressions. Is there any kind of repository for documents that are candidates for conversion testing? I mean documents which are known to have conversion problems, and that are used to test improvements to the filters? I usually just search bugzilla for conversion or formatting :-) Even documents attached to old bugs can be helpful, as they can serve as regression tests. I would like very much to become more involved in improving the conversion filters, since it seems to be a major problem in LibreOffice adoption, and everything that can be done to help in that area would certainly boost LibreOffice adoption specially in the enterprise world. Yes, fidelity of document rendering is definitely one of the biggest hurdles I've faced when encouraging people to try LO. Any improvements on that front will be greatly appreciated! --R ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Document conversion engine
Hi Flavio, On Tue, 2012-07-03 at 11:45 +0100, Flavio Moringa wrote: my name is Flávio Moringa, I'm from Portugal and I'm starting my Masters Dissertation next September (Master in Open Source software - http://moss.dcti.iscte.pt ). Welcome :-) I'm not a programmer, so what I'm interested in doing is something in the lines of investigating the main conversion problems, identifying the possible conversion flows, analysing the way the conversion flow is implemented in LibreOffice, and eventually trying to improve this flow somehow. So - it will be hard to improve the flow without being a programmer I'm afraid :-) From your reply I assume that testing the filters, and doing regression tests is something I could do, maybe identifying the main conversion issues in groups of documents and kind of creating a major conversion issues table, and prioritizing those issues. Is there already something like that? There is a useful QA role in prioritising bug reports and interoperability issues; we have a real problem with masses of bug reports many of which could be duplicates. Having said that - interoperability has many, many known feature / impedance mis-matches that are non-trivial development problems to fix. One thing that -would- be really useful, and that Microsoft have internally, is an analysis tool for Microsoft's XML document formats - such that we can get a good idea of which attributes are actually used much. ie. by analysing and comparing a large corpus of documents out there, we can answer questions such as: should we implement surface charts, or 3D doughnut charts ? given whatever amount of feature-development time we have - simply by referring to the database of crunched XML files to work out which one is used most. It'd be nice to have that for ODF as well too of course for when we have to make zero-sum back-compatibility decisions; but for interoperability crunching those MS documents would be really good. Is that something you could do ? a bit of perl, zip extraction, XML parsing, etc. ? Developers are -much- more likely to let themselves be lead by objective statistics on real documents out there, rather than subjective feelings of priority - which can prove rather controversial :-) Thanks ! Michael. -- michael.me...@suse.com , Pseudo Engineer, itinerant idiot ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Document conversion engine
hi Flavio, On 03/07/12 12:45, Flavio Moringa wrote: I chose as my masters dissertation investigation topic trying to improve the document conversion engine in LibreOffice (ex: converting docx to odt), and as such I would like to know who is working on the conversion engines and how can I help. the document conversion engines in LibreOffice are called Writer, Calc, Draw and Impress. conversion from e.g. DOCX to ODT happens by importing the DOCX file with the DOCX import filter into Writer, and then exporting the document from Writer with the ODF export filter. there are also a few filters (such as XSLT filters, and writerperfect if i remember correctly) that use ODF as an intermediate format, i.e., they import by converting their format to ODF and then importing that into the LO application, and export the reverse way. I'm not a programmer, so what I'm interested in doing is something in the lines of investigating the main conversion problems, identifying the possible conversion flows, analysing the way the conversion flow is implemented in LibreOffice, and eventually trying to improve this flow somehow. it seems to me the main conversion problem is a lack of manpower to improve the filters. oh, and more regression tests would be useful. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Document conversion engine
Hello, my name is Flávio Moringa, I'm from Portugal and I'm starting my Masters Dissertation next September (Master in Open Source software - http://moss.dcti.iscte.pt ). I'm the Caixa Mágica project leader, the main Linux distribution in Portugal (http://www.caixamagica.pt), with deploys reaching almost a million machines with our Linux distribution, mainly in Education. I chose as my masters dissertation investigation topic trying to improve the document conversion engine in LibreOffice (ex: converting docx to odt), and as such I would like to know who is working on the conversion engines and how can I help. I'm not a programmer, so what I'm interested in doing is something in the lines of investigating the main conversion problems, identifying the possible conversion flows, analysing the way the conversion flow is implemented in LibreOffice, and eventually trying to improve this flow somehow. For now I'd like just to get to know the people involved, development plans, and all informatioin you find relevant. Hope to ear from you. You can contact-me directly if you which at: flavio.moringaATcaixamagica.pt Your's trully -- *Flávio Moringa* Project Leader Caixa Mágica Software Energia Open Source Rua Soeiro Pereira Gomes, Lote 1 - 4.º B, Edifício Espanha, 1600-196 Lisboa - Portugal Tel.: +351 217 921 260 Fax: +351 217 921 261 http://www.caixamagica.pt https://twitter.com/flaviomoringa https://www.facebook.com/flaviomoringahttps://www.facebook.com/flavio.moringa http://pt.linkedin.com/in/flaviomoringa http://people.caixamagica.pt/flaviomoringa ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice