subject:"RE\: io module and pdf question"

Re: io module and pdf question

2013-06-26 Thread wxjmfauth

Le mardi 25 juin 2013 06:18:44 UTC+2, jyou...@kc.rr.com a écrit :
> Would like to get your opinion on this.  Currently to get the metadata out of 
> a pdf file, I loop through the guts of the file.  I know it's not the 
> greatest idea to do this, but I'm trying to avoid extra modules, etc.
> 
> 
> 
> Adobe javascript was used to insert the metadata, so the added data looks 
> something like this:
> 
> 
> 
> XYZ:colorList="DarkBlue,Yellow"
> 
> 
> 
> With python 2.7, it successfully loops through the file contents and I'm able 
> to find the line that contains "XYZ:colorList".
> 
> 
> 
> However, when I try to run it with python 3, it errors:
> 
> 
> 
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/codecs.py", 
> line 300, in decode
> 
> (result, consumed) = self._buffer_decode(data, self.errors, final)
> 
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: 
> invalid continuation byte
> 
> 
> 
> I've done some research on this, and it looks like encoding it to latin-1 
> works.  I also found that if I use the io module, it will work on both python 
> 2.7 and 3.3.  For example:
> 
> 
> 
> --
> 
> import io
> 
> import os
> 
> 
> 
> pdfPath = '~/Desktop/test.pdf'
> 
> 
> 
> colorlistData = ''
> 
> 
> 
> with io.open(os.path.expanduser(pdfPath), 'r', encoding='latin-1') as f:
> 
> for i in f:
> 
> if 'XYZ:colorList' in i:
> 
> colorlistData = i.split('XYZ:colorList')[1]
> 
> break
> 
> 
> 
> print(colorlistData)
> 
> --
> 
> 
> 
> As you can tell, I'm clueless in how exactly this works and am hoping someone 
> can give me some insight on:
> 
> 1. Is there another way to get metadata out of a pdf without having to 
> install another module?
> 
> 2. Is it safe to assume pdf files should always be encoded as latin-1 (when 
> trying to read it this way)?  Is there a chance they could be something else?
> 
> 3. Is the io module a good way to pursue this?
> 
> 
> 
> Thanks for your help!
> 
> 
> 
> Jay

---


Forget latin-1.
There is nothing wrong in attempting to get such information
by reading a pdf file in a binary mode. What is important
is to know and be aware about what you are searching and to
do the work correctly.

A complete example with the pdf file, hypermeta.pdf, I produced
which contains the string "abcé€" as Subject metadata.
pdf version: 1.4
producer: LaTeX with hyperref package
(personal comment: "xdvipdfmx")
Python 3.2

>>> with open('hypermeta.pdf', 'rb') as fo:
... r = fo.read()
... 
>>> p1 = r.find(b'Subject<')
>>> p1
4516
>>> p2 = r.find(b'>', p1)
>>> p2
4548
>>> rr = r[p1:p2+1]
>>> rr
b'Subject'
>>> rrr = rr[len(b'Subject<'):-1]
>>> rrr
b'feff00610062006300e920ac'
>>> # decoding the information
>>> rrr = rrr.decode('ascii')
>>> rrr
'feff00610062006300e920ac'
>>> i = 0
>>> a = []
>>> while i < len(rrr):
... t = rrr[i:i+4]
... a.append(t)
... i += 4
... 
>>> a
['feff', '0061', '0062', '0063', '00e9', '20ac']
>>> b = [(int(e, 16) for e in a]
  File "", line 1
b = [(int(e, 16) for e in a]
   ^
SyntaxError: invalid syntax
>>> # oops, error allowed
>>> b = [int(e, 16) for e in a]
>>> b
[65279, 97, 98, 99, 233, 8364]
>>> c = [chr(e) for e in b]
>>> c
['\ufeff', 'a', 'b', 'c', 'é', '€']
>>> # result
>>> d = ''.join(c)
>>> d
'\ufeffabcé€'
>>> d = d[1:]
>>> 
>>> 
>>> d
'abcé€'


As Christian Gollwitzer pointed out, not all objects in a pdf
are encoded in that way. Do not expect to get the contain,
the "text" is that way.
When built with the Unicode technology, the text of a pdf is
composed with a *unique* set of abstract ID's, constructed with
the help of the unicode code points table and with the properties
of the font (OpenType) used in that pdf, this is equivalent to
the utf8/16/32 transformers in "plain unicode".

Luckily for the crowd, in 2103, there are people (devs) who
are understanding the coding of characters, unicode and how
to use it.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: io module and pdf question

2013-06-25 Thread Dave Angel


On 06/25/2013 12:15 PM, jyoun...@kc.rr.com wrote:

Thank you Rusi and Christian!



Something I don't think was mentioned was that reading a text file in 
Python 3, and specifying latin-1, will work simply because every 
possible 8-bit byte is a character in Latin-1  That doesn't mean that 
those characters you get have any connection with the real meaning of 
the file.



So it sounds like I should read the pdf data in as binary:


import os

pdfPath = '~/Desktop/test.pdf'

colorlistData = ''

with open(os.path.expanduser(pdfPath), 'rb') as f:
 for i in f:
 if 'XYZ:colorList' in i:
 colorlistData = i.split('XYZ:colorList')[1]
 break

print(colorlistData)


This gives me the error:
TypeError: Type str doesn't support the buffer API


That's just a tiny piece of the error.  Post the full traceback, which 
shows the line that fails, and what called it, and so on.  In this case 
I'd guess that the line:

for i in f:

is failing since that mechanism is for reading lines in a text file. 
For reading streams of bytes, you have the read() method, where you 
supply your own count.




I admit I know nothing about binary, except it's ones and zeroes.  Is there a 
way to read it in as binary, convert it to ascii/unicode,


That makes no sense without knowing what the binary data represents.  It 
MIGHT be that pieces of it will actually be valid ascii, or valid 
unicode (encoded with some encoding).  But you would have to ask the 
author, or look up the spec for that particular binary file format.


I'm not familiar at all with how PDF's are encoded, so I don't know what 
the possibilities.


One hacky approach is to use the strings utility (standard on most 
versions of Unix/Linux) to basically throw out most of the file, keeping 
only those portions of it that happen to look like reasonable ASCII.  By 
default it captures each consecutive sequence of at least 4 ASCII 
printable characters, and puts a newline to represent one or more 
unprintable or non-ASCII characters between them.


If you cannot find strings (or string) for your OS, you can write the 
filter yourself.


But much better would be to use some library that understood the PDF 
format rules.



and then somehow split it by newline characters so that I can pull the appropriate 
metadata lines out?  For example, XYZ:colorList="DarkBlue,Yellow"

Thanks!

Jay

--


Most of the PDF objects are therefore not encoded. It is, however,
possible to include a PDF into another PDF and to encode it, but that's
a rare case. Therefore the metadata can usually be read in text mode.
However, to correctly find all objects, the xref-table indexes offsets
into the PDF. It must be treated binary in any case, and that's the
funny reason for the first 3 characters of the PDF - they must include
characters with the 8th bit set, such that FTP applications treat it as
binary.



Christian



--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list

Re: io module and pdf question

2013-06-25 Thread MRAB


On 25/06/2013 17:15, jyoun...@kc.rr.com wrote:

Thank you Rusi and Christian!

So it sounds like I should read the pdf data in as binary:


import os

pdfPath = '~/Desktop/test.pdf'

colorlistData = ''

with open(os.path.expanduser(pdfPath), 'rb') as f:
 for i in f:
 if 'XYZ:colorList' in i:
 colorlistData = i.split('XYZ:colorList')[1]
 break

print(colorlistData)


This gives me the error:
TypeError: Type str doesn't support the buffer API

I admit I know nothing about binary, except it's ones and zeroes.  Is there a way to read 
it in as binary, convert it to ascii/unicode, and then somehow split it by newline 
characters so that I can pull the appropriate metadata lines out?  For example, 
XYZ:colorList="DarkBlue,Yellow"


In Python 2, string literals like '' are by default bytestrings. If you
want a Unicode string you need to add the prefix u, so u''.

In Python 3, string literals like '' are by default Unicode. If you
want a bytestring you need to add the prefix b, so b''.

Python 2 was lax when mixing bytestrings with Unicode strings.

Python 3, on the other hand, insists that you know the difference: is
it text (Unicode) or binary data (bytestring)?


Thanks!

Jay

--


Most of the PDF objects are therefore not encoded. It is, however,
possible to include a PDF into another PDF and to encode it, but that's
a rare case. Therefore the metadata can usually be read in text mode.
However, to correctly find all objects, the xref-table indexes offsets
into the PDF. It must be treated binary in any case, and that's the
funny reason for the first 3 characters of the PDF - they must include
characters with the 8th bit set, such that FTP applications treat it as
binary.



Christian




--
http://mail.python.org/mailman/listinfo/python-list

Re: io module and pdf question

2013-06-25 Thread rusi

I guess the string constant 'XYZ:colorlist' needs to be a byte-string -- use b 
prefix?

Dunno for sure. Black hole for me -- unicode!
-- 
http://mail.python.org/mailman/listinfo/python-list

RE: io module and pdf question

2013-06-25 Thread jyoung79

Thank you Rusi and Christian!

So it sounds like I should read the pdf data in as binary:


import os

pdfPath = '~/Desktop/test.pdf'

colorlistData = ''

with open(os.path.expanduser(pdfPath), 'rb') as f:
for i in f:
if 'XYZ:colorList' in i:
colorlistData = i.split('XYZ:colorList')[1]
break

print(colorlistData)


This gives me the error:
TypeError: Type str doesn't support the buffer API

I admit I know nothing about binary, except it's ones and zeroes.  Is there a 
way to read it in as binary, convert it to ascii/unicode, and then somehow 
split it by newline characters so that I can pull the appropriate metadata 
lines out?  For example, XYZ:colorList="DarkBlue,Yellow"

Thanks!

Jay

--

> Most of the PDF objects are therefore not encoded. It is, however, 
> possible to include a PDF into another PDF and to encode it, but that's 
> a rare case. Therefore the metadata can usually be read in text mode. 
> However, to correctly find all objects, the xref-table indexes offsets 
> into the PDF. It must be treated binary in any case, and that's the 
> funny reason for the first 3 characters of the PDF - they must include 
> characters with the 8th bit set, such that FTP applications treat it as 
> binary.

>   Christian
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: io module and pdf question

2013-06-25 Thread Christian Gollwitzer


Am 25.06.13 08:33, schrieb rusi:

On Tuesday, June 25, 2013 9:48:44 AM UTC+5:30, jyou...@kc.rr.com
wrote:

1. Is there another way to get metadata out of a pdf without having
to install another module? 2. Is it safe to assume pdf files should
always be encoded as latin-1 (when trying to read it this way)?  Is
there a chance they could be something else?


If your code is binary open in binary mode (mode="rb") rather than
choosing a bogus encoding. You then have to make your strings also
binary (b-prefix) Also I am surprised that it works at all.  Most
pdfs are compressed I thought??



PDFs are a binary format, yes. But they are not, as a whole, compressed. 
They are made up of "objects", which can be referred to and crosslinked, 
and these objects are represented as ASCII strings. Some of these can 
specify a "decoding filter". Examples for the filter include zlib 
compression (/FlateDecode) and jpeg compression (/DCTDecode).


Most of the PDF objects are therefore not encoded. It is, however, 
possible to include a PDF into another PDF and to encode it, but that's 
a rare case. Therefore the metadata can usually be read in text mode. 
However, to correctly find all objects, the xref-table indexes offsets 
into the PDF. It must be treated binary in any case, and that's the 
funny reason for the first 3 characters of the PDF - they must include 
characters with the 8th bit set, such that FTP applications treat it as 
binary.


Christian
--
http://mail.python.org/mailman/listinfo/python-list

Re: io module and pdf question

2013-06-24 Thread rusi

On Tuesday, June 25, 2013 9:48:44 AM UTC+5:30, jyou...@kc.rr.com wrote:
> 1. Is there another way to get metadata out of a pdf without having to 
> install another module?
> 2. Is it safe to assume pdf files should always be encoded as latin-1 (when 
> trying to read it this way)?  Is there a chance they could be something else?

If your code is binary open in binary mode (mode="rb") rather than choosing a 
bogus encoding. You then have to make your strings also binary (b-prefix)
Also I am surprised that it works at all.  Most pdfs are compressed I thought??

> 3. Is the io module a good way to pursue this?

The docs say:
> The io module provides the Python interfaces to stream handling. Under Python 
> 2.x, this is proposed as an alternative to the built-in file object, but in 
> Python 3.x it is the default interface to access files and streams.

So I guess no point using io for python 3??

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: io module and pdf question

Re: io module and pdf question

Re: io module and pdf question

Re: io module and pdf question

RE: io module and pdf question

Re: io module and pdf question

Re: io module and pdf question

7 matches

Site Navigation

Mail list logo

Footer information