Re: Document conversion engine

2012-07-08 Thread Flavio Moringa
Hi Michael,


nice to ear from someone so up the ranks like you.. makes me feel much
more important :-)

2012/7/6 Michael Meeks michael.me...@suse.com

 Hi Flavio,

 On Tue, 2012-07-03 at 11:45 +0100, Flavio Moringa wrote:
  my name is Flávio Moringa, I'm from Portugal and I'm starting my
  Masters Dissertation next September (Master in Open Source software -
  http://moss.dcti.iscte.pt ).

 Welcome :-)


Thanks


  I'm not a programmer, so what I'm interested in doing is something in
  the lines of investigating the main conversion problems, identifying
  the possible conversion flows, analysing the way the conversion flow
  is implemented in LibreOffice, and eventually trying to improve this
  flow somehow.

 So - it will be hard to improve the flow without being a
 programmer I'm
 afraid :-)


well, although not a programmer right now I've had my fair share of perl,
python, c, bash, java, php... maybe I'm not so fluent in programming
right now, but I'm certainly no strange to it, and definitely not afraid to
do it if the need arises... what I meant was that I'll probably wont't be
able to do a conversion engine by myself... but I can definitely mess
around with code...


  From your reply I assume that testing the filters, and doing
  regression tests is something I could do, maybe identifying the main
  conversion issues in groups of documents and kind of creating a major
  conversion issues table, and prioritizing those issues. Is there
  already something like that?

 There is a useful QA role in prioritising bug reports and
 interoperability issues; we have a real problem with masses of bug
 reports many of which could be duplicates. Having said that -
 interoperability has many, many known feature / impedance mis-matches
 that are non-trivial development problems to fix.

 One thing that -would- be really useful, and that Microsoft have
 internally, is an analysis tool for Microsoft's XML document formats -
 such that we can get a good idea of which attributes are actually used
 much. ie. by analysing and comparing a large corpus of documents out
 there, we can answer questions such as:

 should we implement surface charts, or 3D doughnut charts ?

 given whatever amount of feature-development time we have - simply
 by
 referring to the database of crunched XML files to work out which one is
 used most.

 It'd be nice to have that for ODF as well too of course for when we
 have to make zero-sum back-compatibility decisions; but for
 interoperability crunching those MS documents would be really good.

 Is that something you could do ? a bit of perl, zip extraction, XML
 parsing, etc. ?


Yes, it's definitely something I can do... I do believe that the harder
part is getting that  large corpus of documents out
there At least as my experience goes, I've found that it's hard to get
users to send us documents they use... either due to privacy questions or
enterprise policies... But a tool like that makes a lot of sense


 Developers are -much- more likely to let themselves be lead by
 objective statistics on real documents out there, rather than subjective
 feelings of priority - which can prove rather controversial :-)


I can certainly relate to that...



 Thanks !


For now then I'll start doing as you suggest and look in bugzilla for
documents with conversion problems to try and compile as much examples as I
can. Then maybe using the latest beta to do the conversion and see which
problems are still there. Then maybe starting a perl script that can scrap
the OOXML files to find the most used tags... and start from there...



 Michael.

 --
 michael.me...@suse.com  , Pseudo Engineer, itinerant idiot



Thanks a lot for helping out.
Cheers

-- 
*Flávio Moringa*
Project Leader



Caixa Mágica Software
Energia Open Source
Rua Soeiro Pereira Gomes, Lote 1 - 4.º B,
Edifício Espanha, 1600-196 Lisboa - Portugal
Tel.: +351 217 921 260 Fax: +351 217 921 261
http://www.caixamagica.pt
https://twitter.com/flaviomoringa
https://www.facebook.com/flaviomoringahttps://www.facebook.com/flavio.moringa
http://pt.linkedin.com/in/flaviomoringa
http://people.caixamagica.pt/flaviomoringa
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Document conversion engine

2012-07-08 Thread Flavio Moringa
Hi Robinson,

2012/7/6 Robinson Tryon bishop.robin...@gmail.com

 On Fri, Jul 6, 2012 at 5:51 AM, Flavio Moringa 
 flavio.mori...@caixamagica.pt wrote:


 I know that you can convert documents through the command line, using
 LibreOffice headless mode, and that can be something that's useful for
 scripting automatic tests... although I know that sometimes the main
 problems are visual and it's difficult to automatically detect the
 problems...


 I think that we still need human eyes for the final comparison, however
 the rest of the system could be automated a bit more -- e.g. we could put
 sample docs in subdirectories named by bug# and add screenshots of the docs
 as rendered in MS-Office; add in a script to have LO iterate over the
 subdirectories and spit out screenshots of how it renders the original
 files, and a little HTML GUI so that you can tab-through 2-ups of the
 original rendering vs. LO's rendering, and you've got a decent tool for
 testing improvements/regressions.


That's a good ideia, at least it would facilitate the testing, which is
always very helpful... From my initial investigation in document
conversion, the visual aspect is alwayst the difficult one because not
all thins are well translated to the XML...



 Is there any kind of repository for documents that are candidates for
 conversion testing? I mean documents which are known to have conversion
 problems, and that are used to test improvements to the filters?


 I usually just search bugzilla for conversion or formatting :-) Even
 documents attached to old bugs can be helpful, as they can serve as
 regression tests.


I've just replied to Michael Meeks that I'll do just that... try to compile
a list of available documents in bugzilla that have conversion problems,
and test them on the latest beta... And see which problems still exist...


 I would like very much to become more involved in improving the conversion
 filters, since it seems to be a major problem in LibreOffice adoption, and
 everything that can be done to help in that area would certainly boost
 LibreOffice adoption specially in the enterprise world.


 Yes, fidelity of document rendering is definitely one of the biggest
 hurdles I've faced when encouraging people to try LO. Any improvements on
 that front will be greatly appreciated!

 --R


Thanks

-- 
*Flávio Moringa*
Project Leader



Caixa Mágica Software
Energia Open Source
Rua Soeiro Pereira Gomes, Lote 1 - 4.º B,
Edifício Espanha, 1600-196 Lisboa - Portugal
Tel.: +351 217 921 260 Fax: +351 217 921 261
http://www.caixamagica.pt
https://twitter.com/flaviomoringa
https://www.facebook.com/flaviomoringahttps://www.facebook.com/flavio.moringa
http://pt.linkedin.com/in/flaviomoringa
http://people.caixamagica.pt/flaviomoringa
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice-qa] Document conversion engine

2012-07-08 Thread Flavio Moringa
Hi Michael,


nice to ear from someone so up the ranks like you.. makes me feel much
more important :-)

2012/7/6 Michael Meeks michael.me...@suse.com

 Hi Flavio,

 On Tue, 2012-07-03 at 11:45 +0100, Flavio Moringa wrote:
  my name is Flávio Moringa, I'm from Portugal and I'm starting my
  Masters Dissertation next September (Master in Open Source software -
  http://moss.dcti.iscte.pt ).

 Welcome :-)


Thanks


  I'm not a programmer, so what I'm interested in doing is something in
  the lines of investigating the main conversion problems, identifying
  the possible conversion flows, analysing the way the conversion flow
  is implemented in LibreOffice, and eventually trying to improve this
  flow somehow.

 So - it will be hard to improve the flow without being a
 programmer I'm
 afraid :-)


well, although not a programmer right now I've had my fair share of perl,
python, c, bash, java, php... maybe I'm not so fluent in programming
right now, but I'm certainly no strange to it, and definitely not afraid to
do it if the need arises... what I meant was that I'll probably wont't be
able to do a conversion engine by myself... but I can definitely mess
around with code...


  From your reply I assume that testing the filters, and doing
  regression tests is something I could do, maybe identifying the main
  conversion issues in groups of documents and kind of creating a major
  conversion issues table, and prioritizing those issues. Is there
  already something like that?

 There is a useful QA role in prioritising bug reports and
 interoperability issues; we have a real problem with masses of bug
 reports many of which could be duplicates. Having said that -
 interoperability has many, many known feature / impedance mis-matches
 that are non-trivial development problems to fix.

 One thing that -would- be really useful, and that Microsoft have
 internally, is an analysis tool for Microsoft's XML document formats -
 such that we can get a good idea of which attributes are actually used
 much. ie. by analysing and comparing a large corpus of documents out
 there, we can answer questions such as:

 should we implement surface charts, or 3D doughnut charts ?

 given whatever amount of feature-development time we have - simply
 by
 referring to the database of crunched XML files to work out which one is
 used most.

 It'd be nice to have that for ODF as well too of course for when we
 have to make zero-sum back-compatibility decisions; but for
 interoperability crunching those MS documents would be really good.

 Is that something you could do ? a bit of perl, zip extraction, XML
 parsing, etc. ?


Yes, it's definitely something I can do... I do believe that the harder
part is getting that  large corpus of documents out
there At least as my experience goes, I've found that it's hard to get
users to send us documents they use... either due to privacy questions or
enterprise policies... But a tool like that makes a lot of sense


 Developers are -much- more likely to let themselves be lead by
 objective statistics on real documents out there, rather than subjective
 feelings of priority - which can prove rather controversial :-)


I can certainly relate to that...



 Thanks !


For now then I'll start doing as you suggest and look in bugzilla for
documents with conversion problems to try and compile as much examples as I
can. Then maybe using the latest beta to do the conversion and see which
problems are still there. Then maybe starting a perl script that can scrap
the OOXML files to find the most used tags... and start from there...



 Michael.

 --
 michael.me...@suse.com  , Pseudo Engineer, itinerant idiot



Thanks a lot for helping out.
Cheers

-- 
*Flávio Moringa*
Project Leader



Caixa Mágica Software
Energia Open Source
Rua Soeiro Pereira Gomes, Lote 1 - 4.º B,
Edifício Espanha, 1600-196 Lisboa - Portugal
Tel.: +351 217 921 260 Fax: +351 217 921 261
http://www.caixamagica.pt
https://twitter.com/flaviomoringa
https://www.facebook.com/flaviomoringahttps://www.facebook.com/flavio.moringa
http://pt.linkedin.com/in/flaviomoringa
http://people.caixamagica.pt/flaviomoringa
___
List Name: Libreoffice-qa mailing list
Mail address: Libreoffice-qa@lists.freedesktop.org
Change settings: http://lists.freedesktop.org/mailman/listinfo/libreoffice-qa
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://lists.freedesktop.org/archives/libreoffice-qa/

Re: Document conversion engine

2012-07-06 Thread Flavio Moringa
Hi Michael,

first of all thanks for replying.. I was thinking no one would :-)

From your reply I assume that testing the filters, and doing regression
tests is something I could do, maybe identifying the main conversion issues
in groups of documents and kind of creating a major conversion issues
table, and prioritizing those issues. Is there already something like that?

I know that you can convert documents through the command line, using
LibreOffice headless mode, and that can be something that's useful for
scripting automatic tests... although I know that sometimes the main
problems are visual and it's difficult to automatically detect the
problems...

Is there any kind of repository for documents that are candidates for
conversion testing? I mean documents which are known to have conversion
problems, and that are used to test improvements to the filters?

I would like very much to become more involved in improving the conversion
filters, since it seems to be a major problem in LibreOffice adoption, and
everything that can be done to help in that area would certainly boost
LibreOffice adoption specially in the enterprise world.

Thanks
Flávio

2012/7/5 Michael Stahl mst...@redhat.com

 hi Flavio,

 On 03/07/12 12:45, Flavio Moringa wrote:

  I chose as my masters dissertation investigation topic trying to improve
  the document conversion engine in LibreOffice (ex: converting docx to
  odt), and as such I would like to know who is working on the conversion
  engines and how can I help.

 the document conversion engines in LibreOffice are called Writer, Calc,
 Draw and Impress.  conversion from e.g. DOCX to ODT happens by importing
 the DOCX file with the DOCX import filter into Writer, and then
 exporting the document from Writer with the ODF export filter.

 there are also a few filters (such as XSLT filters, and writerperfect if
 i remember correctly) that use ODF as an intermediate format, i.e., they
 import by converting their format to ODF and then importing that into
 the LO application, and export the reverse way.

  I'm not a programmer, so what I'm interested in doing is something in
  the lines of investigating the main conversion problems, identifying the
  possible conversion flows, analysing the way the conversion flow is
  implemented in LibreOffice, and eventually trying to improve this flow
  somehow.

 it seems to me the main conversion problem is a lack of manpower to
 improve the filters.  oh, and more regression tests would be useful.




-- 
*Flávio Moringa*
Project Leader



Caixa Mágica Software
Energia Open Source
Rua Soeiro Pereira Gomes, Lote 1 - 4.º B,
Edifício Espanha, 1600-196 Lisboa - Portugal
Tel.: +351 217 921 260 Fax: +351 217 921 261
http://www.caixamagica.pt
https://twitter.com/flaviomoringa
https://www.facebook.com/flaviomoringahttps://www.facebook.com/flavio.moringa
http://pt.linkedin.com/in/flaviomoringa
http://people.caixamagica.pt/flaviomoringa
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Document conversion engine

2012-07-06 Thread Robinson Tryon
On Fri, Jul 6, 2012 at 5:51 AM, Flavio Moringa 
flavio.mori...@caixamagica.pt wrote:


 I know that you can convert documents through the command line, using
 LibreOffice headless mode, and that can be something that's useful for
 scripting automatic tests... although I know that sometimes the main
 problems are visual and it's difficult to automatically detect the
 problems...


I think that we still need human eyes for the final comparison, however the
rest of the system could be automated a bit more -- e.g. we could put
sample docs in subdirectories named by bug# and add screenshots of the docs
as rendered in MS-Office; add in a script to have LO iterate over the
subdirectories and spit out screenshots of how it renders the original
files, and a little HTML GUI so that you can tab-through 2-ups of the
original rendering vs. LO's rendering, and you've got a decent tool for
testing improvements/regressions.

Is there any kind of repository for documents that are candidates for
 conversion testing? I mean documents which are known to have conversion
 problems, and that are used to test improvements to the filters?


I usually just search bugzilla for conversion or formatting :-) Even
documents attached to old bugs can be helpful, as they can serve as
regression tests.

I would like very much to become more involved in improving the conversion
 filters, since it seems to be a major problem in LibreOffice adoption, and
 everything that can be done to help in that area would certainly boost
 LibreOffice adoption specially in the enterprise world.


Yes, fidelity of document rendering is definitely one of the biggest
hurdles I've faced when encouraging people to try LO. Any improvements on
that front will be greatly appreciated!

--R
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Document conversion engine

2012-07-06 Thread Michael Meeks
Hi Flavio,

On Tue, 2012-07-03 at 11:45 +0100, Flavio Moringa wrote:
 my name is Flávio Moringa, I'm from Portugal and I'm starting my
 Masters Dissertation next September (Master in Open Source software -
 http://moss.dcti.iscte.pt ).

Welcome :-)

 I'm not a programmer, so what I'm interested in doing is something in
 the lines of investigating the main conversion problems, identifying
 the possible conversion flows, analysing the way the conversion flow
 is implemented in LibreOffice, and eventually trying to improve this
 flow somehow.

So - it will be hard to improve the flow without being a programmer I'm
afraid :-)

 From your reply I assume that testing the filters, and doing
 regression tests is something I could do, maybe identifying the main
 conversion issues in groups of documents and kind of creating a major
 conversion issues table, and prioritizing those issues. Is there
 already something like that?

There is a useful QA role in prioritising bug reports and
interoperability issues; we have a real problem with masses of bug
reports many of which could be duplicates. Having said that -
interoperability has many, many known feature / impedance mis-matches
that are non-trivial development problems to fix.

One thing that -would- be really useful, and that Microsoft have
internally, is an analysis tool for Microsoft's XML document formats -
such that we can get a good idea of which attributes are actually used
much. ie. by analysing and comparing a large corpus of documents out
there, we can answer questions such as:

should we implement surface charts, or 3D doughnut charts ?

given whatever amount of feature-development time we have - simply by
referring to the database of crunched XML files to work out which one is
used most.

It'd be nice to have that for ODF as well too of course for when we
have to make zero-sum back-compatibility decisions; but for
interoperability crunching those MS documents would be really good.

Is that something you could do ? a bit of perl, zip extraction, XML
parsing, etc. ?

Developers are -much- more likely to let themselves be lead by
objective statistics on real documents out there, rather than subjective
feelings of priority - which can prove rather controversial :-)

Thanks !

Michael.

-- 
michael.me...@suse.com  , Pseudo Engineer, itinerant idiot

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Document conversion engine

2012-07-05 Thread Michael Stahl
hi Flavio,

On 03/07/12 12:45, Flavio Moringa wrote:

 I chose as my masters dissertation investigation topic trying to improve
 the document conversion engine in LibreOffice (ex: converting docx to
 odt), and as such I would like to know who is working on the conversion
 engines and how can I help.

the document conversion engines in LibreOffice are called Writer, Calc,
Draw and Impress.  conversion from e.g. DOCX to ODT happens by importing
the DOCX file with the DOCX import filter into Writer, and then
exporting the document from Writer with the ODF export filter.

there are also a few filters (such as XSLT filters, and writerperfect if
i remember correctly) that use ODF as an intermediate format, i.e., they
import by converting their format to ODF and then importing that into
the LO application, and export the reverse way.

 I'm not a programmer, so what I'm interested in doing is something in
 the lines of investigating the main conversion problems, identifying the
 possible conversion flows, analysing the way the conversion flow is
 implemented in LibreOffice, and eventually trying to improve this flow
 somehow.

it seems to me the main conversion problem is a lack of manpower to
improve the filters.  oh, and more regression tests would be useful.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Document conversion engine

2012-07-03 Thread Flavio Moringa
Hello,

my name is Flávio Moringa, I'm from Portugal and I'm starting my Masters
Dissertation next September (Master in Open Source software -
http://moss.dcti.iscte.pt ).

I'm the Caixa Mágica project leader, the main Linux distribution in
Portugal (http://www.caixamagica.pt), with deploys reaching almost a
million machines with our Linux distribution, mainly in Education.

I chose as my masters dissertation investigation topic trying to improve
the document conversion engine in LibreOffice (ex: converting docx to odt),
and as such I would like to know who is working on the conversion engines
and how can I help.

I'm not a programmer, so what I'm interested in doing is something in the
lines of investigating the main conversion problems, identifying the
possible conversion flows, analysing the way the conversion flow is
implemented in LibreOffice, and eventually trying to improve this flow
somehow.

For now I'd like just to get to know the people involved, development
plans, and all informatioin you find relevant.

Hope to ear from you. You can contact-me directly if you which at:
flavio.moringaATcaixamagica.pt

Your's trully

-- 
*Flávio Moringa*
Project Leader



Caixa Mágica Software
Energia Open Source
Rua Soeiro Pereira Gomes, Lote 1 - 4.º B,
Edifício Espanha, 1600-196 Lisboa - Portugal
Tel.: +351 217 921 260 Fax: +351 217 921 261
http://www.caixamagica.pt
https://twitter.com/flaviomoringa
https://www.facebook.com/flaviomoringahttps://www.facebook.com/flavio.moringa
http://pt.linkedin.com/in/flaviomoringa
http://people.caixamagica.pt/flaviomoringa
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice