extracting text from docx files

2011-08-09 Thread Anton Shterenlikht
I often receive information in *.docx format
from my MS using colleagues. Sometimes I can
ask for a pdf (or similar) instead, but not always.

Usually I unzip a docx and then search
through all *xml  files to find the
useful data. However, I can't find any
xml styles to use, so I have to convert
the relevant xml file(s) to plain text
by hand. I wonder if anybody can suggest
a better way. Perhaps there's something
in ports that can help.

Many thanks
Anton


-- 
Anton Shterenlikht
Room 2.6, Queen's Building
Mech Eng Dept
Bristol University
University Walk, Bristol BS8 1TR, UK
Tel: +44 (0)117 331 5944
Fax: +44 (0)117 929 4423
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Rod Person
On Tue, 9 Aug 2011 14:36:32 +0100
Anton Shterenlikht  wrote:

> Usually I unzip a docx and then search
> through all *xml  files to find the
> useful data. However, I can't find any
> xml styles to use, so I have to convert
> the relevant xml file(s) to plain text
> by hand. I wonder if anybody can suggest
> a better way. Perhaps there's something
> in ports that can help.

You could try this for just plain text conversion
http://docx2txt.sourceforge.net/

-- 
Rod
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Anton Shterenlikht
On Tue, Aug 09, 2011 at 09:40:26AM -0400, Rod Person wrote:
> On Tue, 9 Aug 2011 14:36:32 +0100
> Anton Shterenlikht  wrote:
> 
> > Usually I unzip a docx and then search
> > through all *xml  files to find the
> > useful data. However, I can't find any
> > xml styles to use, so I have to convert
> > the relevant xml file(s) to plain text
> > by hand. I wonder if anybody can suggest
> > a better way. Perhaps there's something
> > in ports that can help.
> 
> You could try this for just plain text conversion
> http://docx2txt.sourceforge.net/

Thank you
Anton

-- 
Anton Shterenlikht
Room 2.6, Queen's Building
Mech Eng Dept
Bristol University
University Walk, Bristol BS8 1TR, UK
Tel: +44 (0)117 331 5944
Fax: +44 (0)117 929 4423
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Kurt Buff
On Tue, Aug 9, 2011 at 06:36, Anton Shterenlikht  wrote:
> I often receive information in *.docx format
> from my MS using colleagues. Sometimes I can
> ask for a pdf (or similar) instead, but not always.
>
> Usually I unzip a docx and then search
> through all *xml  files to find the
> useful data. However, I can't find any
> xml styles to use, so I have to convert
> the relevant xml file(s) to plain text
> by hand. I wonder if anybody can suggest
> a better way. Perhaps there's something
> in ports that can help.

My installation of OpenOffice 3.3 on my Win7 machine will open a
Winword 2010 .docx file.

I'm guessing it will do the same on FreeBSD, but I don't have an
install with a GUI running at the moment.

Kurt
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Matthias Apitz
El día Tuesday, August 09, 2011 a las 10:25:30AM -0700, Kurt Buff escribió:

> My installation of OpenOffice 3.3 on my Win7 machine will open a
> Winword 2010 .docx file.
> 
> I'm guessing it will do the same on FreeBSD, but I don't have an
> install with a GUI running at the moment.

It does, using OpenOffice 3.4.0 in 9-CURENT. 

matthias
-- 
Matthias Apitz
t +49-89-61308 351 - f +49-89-61308 399 - m +49-170-4527211
e  - w http://www.unixarea.de/
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Christian Barthel
On Tue, Aug 09, 2011 at 02:36:32PM +0100, Anton Shterenlikht wrote:
> I often receive information in *.docx format
> from my MS using colleagues. Sometimes I can
> ask for a pdf (or similar) instead, but not always.

You have a lot of nice options: 
- Force them to use BSD/Linux ;)
- explain them, why docx is shit!
- don't read it

> 
> Usually I unzip a docx and then search
> through all *xml  files to find the
> useful data. However, I can't find any
> xml styles to use, so I have to convert
> the relevant xml file(s) to plain text
> by hand. I wonder if anybody can suggest
> a better way. Perhaps there's something
> in ports that can help.

But if you really, really need to read docx, you can try the web
application from Microsoft. A few months ago, I got also a lot of docx
and I opend it with the microsoft web app; this worked for me to extract
the information...

More information: 
http://office.microsoft.com/en-us/web-apps/

The downside:  you have to sign up on a microsoft service :( 

cheers

-- 
Christian Barthel 
Public-Key: http://bc.user-mode.org/bc.asc 
Mail: b...@nyx.user-mode.org
Web: http://bc.user-mode.org
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Antonio Olivares
> But if you really, really need to read docx, you can try the web
> application from Microsoft. A few months ago, I got also a lot of docx
> and I opend it with the microsoft web app; this worked for me to extract
> the information...
>
> More information:
> http://office.microsoft.com/en-us/web-apps/
>
> The downside:  you have to sign up on a microsoft service :(
>

Can also use libreoffice.  It is in the ports system :)

Without installing anything, Google Docs also opens *.docx files, if
needed. There are other options too, but it depends on what Anton
wants to install* or just view* & extract?

Regards,

Antonio
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Christian Barthel
On Tue, Aug 09, 2011 at 02:57:51PM -0500, Antonio Olivares wrote:
> > But if you really, really need to read docx, you can try the web
> > application from Microsoft. A few months ago, I got also a lot of docx
> > and I opend it with the microsoft web app; this worked for me to extract
> > the information...
> >
> > More information:
> > http://office.microsoft.com/en-us/web-apps/
> >
> > The downside: ?you have to sign up on a microsoft service :(
> >
> 
> Can also use libreoffice.  It is in the ports system :)

Sure. But libreoffice is a matter of opinion. *I* would never ever
install this  bloated, buggy software product @_@ 

But, I must admit that I am very petted: vim + LaTeX _rocks_ 

> 
> Without installing anything, Google Docs also opens *.docx files, if
> needed. There are other options too, but it depends on what Anton
> wants to install* or just view* & extract?

I have a google account but I never used Google Docs. Nice to know...

> 
> Regards,
> 
> Antonio
> ___
> freebsd-questions@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"

-- 
Christian Barthel 
Public-Key: http://bc.user-mode.org/bc.asc 
Mail: b...@nyx.user-mode.org
Web: http://bc.user-mode.org
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Alejandro Imass
On Tue, Aug 9, 2011 at 3:57 PM, Antonio Olivares
 wrote:
>> But if you really, really need to read docx, you can try the web
>> application from Microsoft. A few months ago, I got also a lot of docx
>> and I opend it with the microsoft web app; this worked for me to extract
>> the information...
>>

just a thought here but if docx is XML why not just find/build some
XSLT that extracts what you need into another format?
you probably have libxml2 and libxslt already in your system, and the
command line utility: xsltproc
there are probably already existing XSLT to transform to RTF and plain text.

--
Alejandro Imass
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Anton Shterenlikht
On Tue, Aug 09, 2011 at 02:57:51PM -0500, Antonio Olivares wrote:
> > But if you really, really need to read docx, you can try the web
> > application from Microsoft. A few months ago, I got also a lot of docx
> > and I opend it with the microsoft web app; this worked for me to extract
> > the information...
> >
> > More information:
> > http://office.microsoft.com/en-us/web-apps/
> >
> > The downside: ?you have to sign up on a microsoft service :(
> >
> 
> Can also use libreoffice.  It is in the ports system :)
> 
> Without installing anything, Google Docs also opens *.docx files, if
> needed. There are other options too, but it depends on what Anton
> wants to install* or just view* & extract?

Well.. I don't really want to install anything
just to read docx. So probably something as
small as possible. libreoffice (even if it's in ports,
which I dearly love) looks like a monster of
a package, so I'm not sure.

Thanks anyway


-- 
Anton Shterenlikht
Room 2.6, Queen's Building
Mech Eng Dept
Bristol University
University Walk, Bristol BS8 1TR, UK
Tel: +44 (0)117 331 5944
Fax: +44 (0)117 929 4423
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Antonio Olivares
> Well.. I don't really want to install anything
> just to read docx. So probably something as
> small as possible. libreoffice (even if it's in ports,
> which I dearly love) looks like a monster of
> a package, so I'm not sure.
>
> Thanks anyway
>
>
> --

abiword is a word processor that opens docx files, and is in the ports :)
You are welcome to check it out :)  I mentioned libreoffice because it
is a full suite but it is BIG :(

It is not a MONSTER :)

Regards,

Antonio
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Warren Block

On Tue, 9 Aug 2011, Anton Shterenlikht wrote:


Well.. I don't really want to install anything
just to read docx. So probably something as
small as possible. libreoffice (even if it's in ports,
which I dearly love) looks like a monster of
a package, so I'm not sure.


Although still relatively large, OpenOffice has fewer dependencies than 
LibreOffice.  My system has OO.o 3.3 installed, and 'make missing' shows 
seventeen new dependencies needed by LibreOffice.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-09 Thread Chris Hill

On Tue, 9 Aug 2011, Anton Shterenlikht wrote:


On Tue, Aug 09, 2011 at 02:57:51PM -0500, Antonio Olivares wrote:

But if you really, really need to read docx, you can try the web
application from Microsoft. A few months ago, I got also a lot of docx
and I opend it with the microsoft web app; this worked for me to extract
the information...

More information:
http://office.microsoft.com/en-us/web-apps/

The downside: ?you have to sign up on a microsoft service :(



Can also use libreoffice.  It is in the ports system :)

Without installing anything, Google Docs also opens *.docx files, if
needed. There are other options too, but it depends on what Anton
wants to install* or just view* & extract?


Well.. I don't really want to install anything
just to read docx. So probably something as
small as possible. libreoffice (even if it's in ports,
which I dearly love) looks like a monster of
a package, so I'm not sure.


Maybe an online service? If you don't have too many to convert at one 
time, and there's nothing secret in them, you could try 
http://www.doc2pdf.net/ - I've never used it, so caveat clicktor.


--
Chris Hill   ch...@monochrome.org
** [ Busy Expunging  ]
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-11 Thread Polytropon
On Tue, 9 Aug 2011 21:16:11 +0200, Christian Barthel wrote:
> On Tue, Aug 09, 2011 at 02:36:32PM +0100, Anton Shterenlikht wrote:
> > I often receive information in *.docx format
> > from my MS using colleagues. Sometimes I can
> > ask for a pdf (or similar) instead, but not always.
> 
> You have a lot of nice options: 
> - Force them to use BSD/Linux ;)
> - explain them, why docx is shit!
> - don't read it

I also suggest to combine this with reading the following
article:

http://en.nothingisreal.com/wiki/Please_don't_send_me_Microsoft_Word_documents

It's very polite and precise about why using "DOC" files
is generally a bad idea. It can be easily concluded that
it also applies to "DOCX" files.

The document also discusses alternatives.



-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-11 Thread Anton Shterenlikht
On Thu, Aug 11, 2011 at 12:14:51PM +0200, Polytropon wrote:
> On Tue, 9 Aug 2011 21:16:11 +0200, Christian Barthel wrote:
> > On Tue, Aug 09, 2011 at 02:36:32PM +0100, Anton Shterenlikht wrote:
> > > I often receive information in *.docx format
> > > from my MS using colleagues. Sometimes I can
> > > ask for a pdf (or similar) instead, but not always.
> > 
> > You have a lot of nice options: 
> > - Force them to use BSD/Linux ;)
> > - explain them, why docx is shit!
> > - don't read it
> 
> I also suggest to combine this with reading the following
> article:
> 
> http://en.nothingisreal.com/wiki/Please_don't_send_me_Microsoft_Word_documents
> 
> It's very polite and precise about why using "DOC" files
> is generally a bad idea. It can be easily concluded that
> it also applies to "DOCX" files.
> 
> The document also discusses alternatives.

That's not my war. It's not going to achive
much me telling all our admin and academic
staff that what they were tought throughout
their career might not be ideal, or even
not the only, tool in the universe.
Sometimes I can request pdf, sometimes I fail.

I also sometimes try to get pdf from various
UK govt departments. Sometimes they only
make documents available in MS formats.
Again, sometimes they respond well, but
mostly, they ignore my requests.

By the way, I tried abiword, and it couldn't
open my docx.

-- 
Anton Shterenlikht
Room 2.6, Queen's Building
Mech Eng Dept
Bristol University
University Walk, Bristol BS8 1TR, UK
Tel: +44 (0)117 331 5944
Fax: +44 (0)117 929 4423
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: extracting text from docx files

2011-08-11 Thread Ruben de Groot

There are several docx converters online (google docx2pdf). Haven't tried them
though. LibreOffice handles docx quite well.

On Thu, Aug 11, 2011 at 12:22:22PM +0100, Anton Shterenlikht typed:
> On Thu, Aug 11, 2011 at 12:14:51PM +0200, Polytropon wrote:
> > On Tue, 9 Aug 2011 21:16:11 +0200, Christian Barthel wrote:
> > > On Tue, Aug 09, 2011 at 02:36:32PM +0100, Anton Shterenlikht wrote:
> > > > I often receive information in *.docx format
> > > > from my MS using colleagues. Sometimes I can
> > > > ask for a pdf (or similar) instead, but not always.
> > > 
> > > You have a lot of nice options: 
> > > - Force them to use BSD/Linux ;)
> > > - explain them, why docx is shit!
> > > - don't read it
> > 
> > I also suggest to combine this with reading the following
> > article:
> > 
> > http://en.nothingisreal.com/wiki/Please_don't_send_me_Microsoft_Word_documents
> > 
> > It's very polite and precise about why using "DOC" files
> > is generally a bad idea. It can be easily concluded that
> > it also applies to "DOCX" files.
> > 
> > The document also discusses alternatives.
> 
> That's not my war. It's not going to achive
> much me telling all our admin and academic
> staff that what they were tought throughout
> their career might not be ideal, or even
> not the only, tool in the universe.
> Sometimes I can request pdf, sometimes I fail.
> 
> I also sometimes try to get pdf from various
> UK govt departments. Sometimes they only
> make documents available in MS formats.
> Again, sometimes they respond well, but
> mostly, they ignore my requests.
> 
> By the way, I tried abiword, and it couldn't
> open my docx.
> 
> -- 
> Anton Shterenlikht
> Room 2.6, Queen's Building
> Mech Eng Dept
> Bristol University
> University Walk, Bristol BS8 1TR, UK
> Tel: +44 (0)117 331 5944
> Fax: +44 (0)117 929 4423
> ___
> freebsd-questions@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"