Re: imaplib: is this really so unwieldy?

2021-05-30 Thread boB Stepp
On Sun, May 30, 2021 at 1:04 AM hw  wrote:
>
> On 5/28/21 2:36 AM, boB Stepp wrote:

> > 
> > Just as SMTP is the protocol for sending email, the Internet Message
> > Access Protocol (IMAP) specifies how to communicate with an email
> > provider’s server to retrieve emails sent to your email address.
> > Python comes with an imaplib module, but in fact the third-party
> > imapclient module is easier to use. This chapter provides an
> > introduction to using IMAPClient; the full documentation is at
> > http://imapclient.readthedocs.org/.
> >
> > The imapclient module downloads emails from an IMAP server in a rather
> > complicated format. Most likely, you’ll want to convert them from this
> > format into simple string values. The pyzmail module does the hard job
> > of parsing these email messages for you. You can find the complete
> > documentation for PyzMail at http://www.magiksys.net/pyzmail/.
> >
> > Install imapclient and pyzmail from a Terminal window. Appendix A has
> > steps on how to install third-party modules.
> > 

> I don't know which imaplib the author uses; the imaplib I found
> definitely doesn't give uids of the messages, contrary to the example
> he's giving.

Look at the above three paragraphs quoted from my original response.
The author is using *imapclient* and *pyzmail* as the author
indicates.

boB Stepp
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-30 Thread hw

On 5/28/21 2:36 AM, boB Stepp wrote:

On Thu, May 27, 2021 at 6:22 PM Cameron Simpson  wrote:


On 27May2021 18:42, hw  wrote:



So it seems that IMAP support through python is virtually non-existent.


This still sureprises me, but I've not tried to use IMAP seriously. I
read email locally, and collect it with POP instead. With a tool I wrote
myself in Python, as it happens.


I am out of my league here, but what I found in one of my books might
be helpful.  Al Sweigart wrote a useful book, "Automate the Boring
Stuff in Python".  In chapter 16 he considers email.  In the "IMAP"
section he states:


Just as SMTP is the protocol for sending email, the Internet Message
Access Protocol (IMAP) specifies how to communicate with an email
provider’s server to retrieve emails sent to your email address.
Python comes with an imaplib module, but in fact the third-party
imapclient module is easier to use. This chapter provides an
introduction to using IMAPClient; the full documentation is at
http://imapclient.readthedocs.org/.

The imapclient module downloads emails from an IMAP server in a rather
complicated format. Most likely, you’ll want to convert them from this
format into simple string values. The pyzmail module does the hard job
of parsing these email messages for you. You can find the complete
documentation for PyzMail at http://www.magiksys.net/pyzmail/.

Install imapclient and pyzmail from a Terminal window. Appendix A has
steps on how to install third-party modules.


In the next little section he shows how to retrieve and delete emails
with IMAP using the two third-party tools mentioned above.  And of
course there is more.  Apparently this book is now in its second
edition.  The first edition is available online for free.  The link to
chapter 16 which discusses email is:
https://automatetheboringstuff.com/chapter16/  Hopefully this will
prove helpful to the OP.


Thanks for the pointer!

I don't know which imaplib the author uses; the imaplib I found 
definitely doesn't give uids of the messages, contrary to the example 
he's giving.

--
https://mail.python.org/mailman/listinfo/python-list


Re: string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-28 Thread jak

Il 27/05/2021 05:54, Cameron Simpson ha scritto:

On 26May2021 12:11, Jon Ribbens  wrote:

On 2021-05-26, Alan Gauld  wrote:

I confess I had just assumed the unicode strings were stored
in native unicode UTF8 format.


If you do that then indexing and slicing strings becomes very slow.


True, but that isn't necessarily a show stopper. My impression, on
reflection, is that most slicing is close to the beginning or end of a
string, and that _most strings are small. (Alan has exceptions at least
to the latter.) In those circumstances, the cost of slicing a variable
width encoding is greatly mitigated.

Indexing is probably more general (in my subjective hand waving
guesstimation). But... how common is indexing into large strings?
Versus, say, iteration over a large string?

I was surprised when getting introduced to Golang a few years ago that
it stores all Strings as UTF8 byte sequences. And when writing Go code,
I found very few circumstances where that would actually bring
performance issues, which I attribute in part to my suggestions above
about when, in practical terms, we slice and index strings.

If the internal storage is UTF8, then in an ecosystem where all, or
most, text files are themselves UTF8 then reading a text file has zero
decoding cost - you can just read the bytes and store them! And to write
a String out to a UTF8 file, you just copy the bytes - zero encoding!




Also, UTF8 is a funny thing - it is deliberately designed so that you
can just jump into the middle of an arbitrary stream of UTF8 bytes and
find the character boundaries. That doesn't solve slicing/indexing in
general, but it does avoid any risk of producing mojibake just by
starting your decode at a random place.



Perhaps you are referring to what the python language does if you jump 
to an albiter position of an utf8 string. Otherwise, before you start 
decoding, you should align at the beginning of an utf8 character by 
discarding the bytes that meet the following test:


(byte & 0xc0) == 0x80 /* Clang */



Cheers,
Cameron Simpson 



--
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-27 Thread boB Stepp
On Thu, May 27, 2021 at 6:22 PM Cameron Simpson  wrote:
>
> On 27May2021 18:42, hw  wrote:

> >So it seems that IMAP support through python is virtually non-existent.
>
> This still sureprises me, but I've not tried to use IMAP seriously. I
> read email locally, and collect it with POP instead. With a tool I wrote
> myself in Python, as it happens.

I am out of my league here, but what I found in one of my books might
be helpful.  Al Sweigart wrote a useful book, "Automate the Boring
Stuff in Python".  In chapter 16 he considers email.  In the "IMAP"
section he states:


Just as SMTP is the protocol for sending email, the Internet Message
Access Protocol (IMAP) specifies how to communicate with an email
provider’s server to retrieve emails sent to your email address.
Python comes with an imaplib module, but in fact the third-party
imapclient module is easier to use. This chapter provides an
introduction to using IMAPClient; the full documentation is at
http://imapclient.readthedocs.org/.

The imapclient module downloads emails from an IMAP server in a rather
complicated format. Most likely, you’ll want to convert them from this
format into simple string values. The pyzmail module does the hard job
of parsing these email messages for you. You can find the complete
documentation for PyzMail at http://www.magiksys.net/pyzmail/.

Install imapclient and pyzmail from a Terminal window. Appendix A has
steps on how to install third-party modules.


In the next little section he shows how to retrieve and delete emails
with IMAP using the two third-party tools mentioned above.  And of
course there is more.  Apparently this book is now in its second
edition.  The first edition is available online for free.  The link to
chapter 16 which discusses email is:
https://automatetheboringstuff.com/chapter16/  Hopefully this will
prove helpful to the OP.

HTH!
boB Stepp
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-27 Thread Cameron Simpson
On 27May2021 18:42, hw  wrote:
>On 5/26/21 12:25 AM, Cameron Simpson wrote:
>>On 25May2021 19:21, hw  wrote:
>>>On 5/25/21 11:38 AM, Cameron Simpson wrote:
On 25May2021 10:23, hw  wrote:
>>You'd be surprised how useful it is to make almost any standalone
>>programme a module like this - in the medium term it almost always pays
>>off for me. Even just the discipline of shoving all the formerly-global
>>variables in the main function brings lower-bugs benefits.
>
>What do you do with it when importing it?  Do you somehow design your 
>programs as modules in some way that makes them usable as some kind of 
>library funktion?

Michael Torrie addressed this. In short, inspecting __name__ tells you 
if your module was run as a main programme or imported. That's what the 
test for '__main__' checks.

So modules like this tend to be:

module code

if __name__ == '__main__':
run as a command line main programme

And I personally spell that last line as:

sys.exit(main(sys.argv))

and define a main() function near the top of the module.

>>Chris has addressed this. msgnums is list of the data components of 
>>the
>>IMAP response.  By going str(msgnums) you're not getting "the message
>>numbers as text" you're getting what printing a list prints. Which is
>>roughly Python code: the brakcets and the repr() of each list member.
>
>Well, of course I exepect to get message numbers returned from such a 
>library function, not the raw imap response.  What is the library for 
>when I have to figure it all out by myself anyway.

The library, for this particular library:
- conceals the async side, giving you plain request=>response function 
  calls
- parses the response in its generic structural form and hands you the 
  components

It's a relatively thin shim for the IMAP protocol itself, not a fully 
fledged mailbox parser.

Since the search function can return many many weird and wonderful 
things depending what you ask it for, this particular library (a) does 
not parse the search field parameter ('(UID)') or the responses. It 
passes them through and 9b) correspondingly does not interpret the 
result beyond unpacking the IMAP response into the data packets 
provided.

What you really want is a "search_uids()" function, which calls 
.search() with '(UID)' _and_ parses the particular style of ressponse 
that '(UID)' produces from a server. imaplib doesn't have that, and that 
is a usability deficiency. It could do with a small suite of 
convenience/helper functions that call the core IMAP methods in 
particular commonly used ways.

>>Notice that the example code accessed msgnums[0] - that is the first
>>data component, a bytes. That you _can_ convert to a string (under
>>assumptions about the encoding).
>
>And of course, I don't want to randomly convert bytes into strings ...

Well, you don't need to. Splitting on "whitespace" is enough - you then 
have a list of individual UID bytes objects, which can probably be 
passed areound opaquely - you don't ordinarily need to _care_ that 
they're bytes transcriptions of the server numeric UID _values_ - you 
can at that point just treat them as tokens.

>>By getting the "str" form of a list, you're forced into the weird [3:-2]
>>hack to ttrim the ends. But it is just a hack for a transcription
>>mistake, not a sane parse.
>
>Right, thats why I don't like it and is part of what makes it so unwieldy.

Bear with me - I'm going to be quite elaborate here.

That _particular_ problem is because you transcribed a 
list-of-bytes-objects as a string (as Python would have printed such a 
thing with the print() function). It's just the wrong thing to do here, 
regardless of the language.

Consider (at the Python interactive ">>> " prompt):

>>> nums = [1, 2, 3]

A list of ints. print(nums) outputs:

>>> print(nums)
[1, 2, 3]

print() works by writing out str() of each of its arguments. So the 
above is a string written to the output. Looking at str():

>>> str(nums)
'[1, 2, 3]'

It's a string, with a leading '[' character, a decimal transcription of 
the numeric values with ', ' between those transcriptions, and a 
trailing ']' character.

The single quotes above are the interactive interpreter printing that 
Python str value using repr(), as you would type it to a Python 
programme.

To do what you did with the msgnums I'd go:

str(nums)[1:-1].split(', ')

which would get me a list of str values. But that cumbersomeness is 
because of the str(nums), which turned nums into a string transcription 
of the list of values. Let's do the equivalent with a made up IMAP 
search response:

msgnums = [b'123 456 789']

The above, I gather, is the source of your [3:-2] thing: trim the "b['" 
and "']" markers.

I'm using msgnums here because you named it that way, but the response 
from a search() is actually a list of data components. For 
search('(UID)') that list one has one element, the bytes data component 
with UIDs transcribed within 

Re: imaplib: is this really so unwieldy?

2021-05-27 Thread hw

On 5/25/21 3:55 PM, Grant Edwards wrote:

On 2021-05-25, hw  wrote:


I'm about to do stuff with emails on an IMAP server and wrote a program
using imaplib


My recollection of using imaplib a few years ago is that yes, it is
unweildy, oddly low-level, and rather un-Pythonic (excuse my
presumption in declaring what is and isn't "Pythonic").


It's good to know that it's not me and that I simply made a bad pick for 
something to learn python with :)



I switched to using imaplib2 and found it much easier to use. It's a
higher-level wrapper for imaplib.

I think this is the currently maintained fork:

   https://github.com/jazzband/imaplib2

I haven't activly used either for several years, so things may have
changed...

--
Grant



--
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-27 Thread Michael Torrie
On 5/27/21 10:42 AM, hw wrote:
> What do you do with it when importing it?  Do you somehow design your 
> programs as modules in some way that makes them usable as some kind of 
> library funktion?

Yes, precisely.  Typically I break up my python projects into logical
modules, which are each kind of like a library in their own right.  The
if __name__=="__main__" idiom is really handy for this. It lets me build
testing into each module. If I run the module directly, it can execute
tests on the functions in that module. If it's imported, the code inside
the if __name__=="__main__" block is not executed.  Sometimes I'll have
a python file that can run standalone, using command-line arguments, or
be imported by something else and used that.  All depends on my needs at
the moment, but this mechanism is very powerful.

Note that any py file is loaded and executed when imported.  So any
module-level initialization code gets run whether a file is imported or
run directly (which is why the if __name__=="__main__" idiom works).
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-27 Thread Chris Angelico
On Fri, May 28, 2021 at 4:04 AM Peter J. Holzer  wrote:
>
> On 2021-05-26 08:34:28 +1000, Chris Angelico wrote:
> > Yes, any given string has a single width, which makes indexing fast.
> > The memory cost you're describing can happen, but apart from a BOM
> > widening an otherwise-ASCII string to 16-bit, there aren't many cases
> > where you'll get a single wide character in a narrow string.
>
> A single emoji in a long English text.
>
> > Usually, if there are any wide characters, there'll be a good number
> > of them
>
> Oh, right. People who use emoji usually use a lot of them .
>
>

Exactly :) I can easily imagine a short block of text with just one
(say, a single email), but if you have a gigabyte or even a couple
hundred meg of text, the chances of *a single* emoji become
vanishingly slim.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-27 Thread Peter J. Holzer
On 2021-05-26 08:34:28 +1000, Chris Angelico wrote:
> Yes, any given string has a single width, which makes indexing fast.
> The memory cost you're describing can happen, but apart from a BOM
> widening an otherwise-ASCII string to 16-bit, there aren't many cases
> where you'll get a single wide character in a narrow string.

A single emoji in a long English text.

> Usually, if there are any wide characters, there'll be a good number
> of them

Oh, right. People who use emoji usually use a lot of them .

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-27 Thread hw

On 5/26/21 12:25 AM, Cameron Simpson wrote:

On 25May2021 19:21, hw  wrote:

On 5/25/21 11:38 AM, Cameron Simpson wrote:

On 25May2021 10:23, hw  wrote:

if status != 'OK':
print('Login failed')
exit


Your "exit" won't do what you want. I expect this code to raise a
NameError exception here (you've not defined "exit"). That _will_ abort
the programme, but in a manner indicating that you're used an unknown
name.  You probably want:

 sys.exit(1)

You'll need to import "sys".


Oh ok, it seemed to be fine.  Would it be the right way to do it with
sys.exit()?  Having to import another library just to end a program
might not be ideal.


To end a programme early, yes. (sys.exit() actually just raises a
particular exception, BTW.)

I usually write a distinct main function, so in that case one can just
"return". After all, what seems an end-of-life circumstance in a
standalone script like yours is just an "end this function" circumstance
when viewed as a function, and that also lets you _call_ the main
programme from some outer thing. Wouldn't want that outer thing
cancelled, if it exists.

My usual boilerplate for a module with a main programme looks like this:

 import sys
 ..
 def main(argv):
 ... main programme, return like any other function ...
  other code for the module - functions, classes etc ...
 if __name__ == '__main__':
 sys.exit(main(sys.argv))

which (a) puts main(0 up the top where it can be seen, (b) makes main()
an ordinary function like any other (c) lets me just import that module
elsewhere and (d) no globals - everything's local to main().

The __name__ boilerplate at the bottom is the magic which figures out if
the module was imported (__name__ will be the import module name) or
invoked from the command line like:

 python -m my_module cmd-line-args...

in which case __name__ has the special value '__main__'. A historic
mechanism which you will convince nobody to change.


Thanks, that seems like good advice.


You'd be surprised how useful it is to make almost any standalone
programme a module like this - in the medium term it almost always pays
off for me. Even just the discipline of shoving all the formerly-global
variables in the main function brings lower-bugs benefits.


What do you do with it when importing it?  Do you somehow design your 
programs as modules in some way that makes them usable as some kind of 
library funktion?



I've done little with IMAP. What's in msgnums here? Eg:
 print(type(msgnums), repr(msgnums))
just so we all know what we're dealing with here.


 [b'']


message_uuids = []
for number in str(msgnums)[3:-2].split():


This is very strange. [...]

Yes, and I don't understand it.  'print(msgnums)' prints:

[b'']

when there are no messages and

[b'1 2 3 4 5']


Chris has addressed this. msgnums is list of the data components of the
IMAP response.  By going str(msgnums) you're not getting "the message
numbers as text" you're getting what printing a list prints. Which is
roughly Python code: the brakcets and the repr() of each list member.


Well, of course I exepect to get message numbers returned from such a 
library function, not the raw imap response.  What is the library for 
when I have to figure it all out by myself anyway.



Notice that the example code accessed msgnums[0] - that is the first
data component, a bytes. That you _can_ convert to a string (under
assumptions about the encoding).


And of course, I don't want to randomly convert bytes into strings ...


By getting the "str" form of a list, you're forced into the weird [3:-2]
hack to ttrim the ends. But it is just a hack for a transcription
mistake, not a sane parse.


Right, thats why I don't like it and is part of what makes it so unwieldy.


So I was guessing that it might be an array containing a single a
string and that refering to the first element of the array turns into
a string with which split() can used.  But 'print(msgnums[0].split())'
prints

[b'1', b'2', b'3', b'4', b'5']


msgnums[0] is bytes. You can do most str things with bytes (because that
was found to be often useful) but you get bytes back from those
operations as you'd hope.


As someone unfamiliar with python, I was wondering what this output 
means.  It could be items of an array that are bytes containing numbers, 
like, in binary, 0001, 0010, and so on.  That's more like what I 
would expect and not something I would want to convert into a string.



so I can only guess what that's supposed to mean: maybe an array of
many bytes?  The documentation[1] clearly says: "The message_set
options to commands below is a string [...]"


But that is the parameter to the _call_: your '(UID)' parameter.


No, it's not a uid.  With library, I haven't found a way to get uids. 
The function to call reqires a string, not bytes or an array of bytes.


The included example contradicts the documentation and leaves potential 
users of the library to guessing.



I 

Re: string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-27 Thread Chris Angelico
On Thu, May 27, 2021 at 1:56 PM Cameron Simpson  wrote:
>
> On 26May2021 12:11, Jon Ribbens  wrote:
> >On 2021-05-26, Alan Gauld  wrote:
> >> I confess I had just assumed the unicode strings were stored
> >> in native unicode UTF8 format.
> >
> >If you do that then indexing and slicing strings becomes very slow.
>
> True, but that isn't necessarily a show stopper. My impression, on
> reflection, is that most slicing is close to the beginning or end of a
> string, and that _most strings are small. (Alan has exceptions at least
> to the latter.) In those circumstances, the cost of slicing a variable
> width encoding is greatly mitigated.
>
> Indexing is probably more general (in my subjective hand waving
> guesstimation). But... how common is indexing into large strings?
> Versus, say, iteration over a large string?

Common enough that, when all this was originally discussed, O(1)
indexing and slicing was mandated. It wasn't until MicroPython came
along that it was even entertained as a possibility that O(n) slicing
could be reasonable.

> I was surprised when getting introduced to Golang a few years ago that
> it stores all Strings as UTF8 byte sequences. And when writing Go code,
> I found very few circumstances where that would actually bring
> performance issues, which I attribute in part to my suggestions above
> about when, in practical terms, we slice and index strings.
>
> If the internal storage is UTF8, then in an ecosystem where all, or
> most, text files are themselves UTF8 then reading a text file has zero
> decoding cost - you can just read the bytes and store them! And to write
> a String out to a UTF8 file, you just copy the bytes - zero encoding!

True. IF everything is indeed in the same encoding.

> Also, UTF8 is a funny thing - it is deliberately designed so that you
> can just jump into the middle of an arbitrary stream of UTF8 bytes and
> find the character boundaries. That doesn't solve slicing/indexing in
> general, but it does avoid any risk of producing mojibake just by
> starting your decode at a random place.

Yes, that's true, you can avoid mojibake. But you still can't easily
say "which is the 505005th character". The only way for it to work is
to have some kind of string reference type that carries both the
character index and the byte position, and is capable of arithmetic;
and now we're into the messes of pointer manipulation. Whichever way
you do it, you're just moving the mess around.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-26 Thread Cameron Simpson
On 26May2021 12:11, Jon Ribbens  wrote:
>On 2021-05-26, Alan Gauld  wrote:
>> I confess I had just assumed the unicode strings were stored
>> in native unicode UTF8 format.
>
>If you do that then indexing and slicing strings becomes very slow.

True, but that isn't necessarily a show stopper. My impression, on 
reflection, is that most slicing is close to the beginning or end of a 
string, and that _most strings are small. (Alan has exceptions at least 
to the latter.) In those circumstances, the cost of slicing a variable 
width encoding is greatly mitigated.

Indexing is probably more general (in my subjective hand waving 
guesstimation). But... how common is indexing into large strings?  
Versus, say, iteration over a large string?

I was surprised when getting introduced to Golang a few years ago that 
it stores all Strings as UTF8 byte sequences. And when writing Go code, 
I found very few circumstances where that would actually bring 
performance issues, which I attribute in part to my suggestions above 
about when, in practical terms, we slice and index strings.

If the internal storage is UTF8, then in an ecosystem where all, or 
most, text files are themselves UTF8 then reading a text file has zero 
decoding cost - you can just read the bytes and store them! And to write 
a String out to a UTF8 file, you just copy the bytes - zero encoding!

Also, UTF8 is a funny thing - it is deliberately designed so that you 
can just jump into the middle of an arbitrary stream of UTF8 bytes and 
find the character boundaries. That doesn't solve slicing/indexing in 
general, but it does avoid any risk of producing mojibake just by 
starting your decode at a random place.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-26 Thread Alan Gauld via Python-list
On 26/05/2021 22:15, Tim Chase wrote:

> If you don't decode it upon reading it in, it should still be 100MB
> because it's a stream of encoded bytes.  

I usually convert them to utf8.

> You don't specify what you then do with this humongous string, 

Mainly I search for regex patterns which can span multiple lines.
I could chunk it up if memory was an issue but a single read is
just more convenient. Up until now it hasn't been an issue and
to be honest I don't often hit multi-byte characters so mostly
it will be single byte character strings.

They are mostly research papers and such from my university days
written on a Commodore PET and various early DOS computers with
weird long-lost word processors. Over the years they've been
exported/converted/reimported and then re-xported several times.
A very few have embedded text or "graphics"/equations which might
have some unicode characters but its not a big issue for me in practice.
I was more just thinking of the kinds of scenario where big strings
might become a problem if suddenly consuming 4x the storage
you expect.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-26 Thread Tim Chase
On 2021-05-26 18:43, Alan Gauld via Python-list wrote:
> On 26/05/2021 14:09, Tim Chase wrote:
>>> If so, doesn't that introduce a pretty big storage overhead for
>>> large strings?  
>> 
>> Yes.  Though such large strings tend to be more rare, largely
>> because they become unweildy for other reasons.  
> 
> I do have some scripts that work on large strings - mainly produced
> by reading an entire text file into a string using file.read().
> Some of these are several MB long so potentially now 4x bigger than
> I thought. But you are right, even a 100MB string should still be
> OK on a modern PC with 8GB+ RAM!...

If you don't decode it upon reading it in, it should still be 100MB
because it's a stream of encoded bytes.  It would only 2x or 4x in
size if you decoded that (either as a parameter of how you opened it,
or if you later took that string and decoded it explicitly, though
now you have the original 100MB byte-string **plus** the 100/200/400MB
decoded unicode string).

You don't specify what you then do with this humongous string, but
for most of my large files like this, I end up iterating over them
piecewise rather than f.read()'ing them all in at once. Or even if
the whole file does end up in memory, it's usually chunked and split
into useful pieces.  That could mean that each line is its own
string, almost all of which are one-byte-per-char with a couple
strings at sporadic positions in the list-of-strings where they are
2/4 bytes per char.

-tkc


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-26 Thread Alan Gauld via Python-list
On 26/05/2021 14:09, Tim Chase wrote:

>> If so, doesn't that introduce a pretty big storage overhead for
>> large strings?
> 
> Yes.  Though such large strings tend to be more rare, largely because
> they become unweildy for other reasons.

I do have some scripts that work on large strings - mainly produced by
reading an entire text file into a string using file.read(). Some of
these are several MB long so potentially now 4x bigger than I thought.
But you are right, even a 100MB string should still be OK on a
modern PC with 8GB+ RAM!...

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-26 Thread Terry Reedy

On 5/26/2021 12:07 PM, Chris Angelico wrote:

On Thu, May 27, 2021 at 1:59 AM Jon Ribbens via Python-list
 wrote:


On 2021-05-26, Alan Gauld  wrote:

On 25/05/2021 23:23, Terry Reedy wrote:

In CPython's Flexible String Representation all characters in a string
are stored with the same number of bytes, depending on the largest
codepoint.


I'm learning lots of new things in this thread!

Does that mean that if I give Python a UTF8 string that is mostly single
byte characters but contains one 4-byte character that Python will store
the string as all 4-byte characters?


Note that while unix uses utf-8, Windows uses utf-16.


If so, doesn't that introduce a pretty big storage overhead for
large strings?


Memory is cheap ;-)



This is true, but sometimes memory translates into time - either
direction. When the Flexible String Representation came in, it was
actually an alternative to using four bytes per character on ALL
strings (not just those that contain non-BMP characters),


Except on Windows, where CPython used 2 bytes/char + surrogates for 
non-BMP char.  This meant that indexing did not quite work on Windows 
and that applications that allowed astral chars and wanted to work on 
all systems had to have separate code for Windows and unix-based systems.



and it
actually improved performance quite notably, despite some additional
complications.


And it made CPython text manipulation code work on all CPython system.


Performance optimization is a funny science :)



--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-26 Thread Chris Angelico
On Thu, May 27, 2021 at 1:59 AM Jon Ribbens via Python-list
 wrote:
>
> On 2021-05-26, Alan Gauld  wrote:
> > On 25/05/2021 23:23, Terry Reedy wrote:
> >> In CPython's Flexible String Representation all characters in a string
> >> are stored with the same number of bytes, depending on the largest
> >> codepoint.
> >
> > I'm learning lots of new things in this thread!
> >
> > Does that mean that if I give Python a UTF8 string that is mostly single
> > byte characters but contains one 4-byte character that Python will store
> > the string as all 4-byte characters?
> >
> > If so, doesn't that introduce a pretty big storage overhead for
> > large strings?
>
> Memory is cheap ;-)
>

This is true, but sometimes memory translates into time - either
direction. When the Flexible String Representation came in, it was
actually an alternative to using four bytes per character on ALL
strings (not just those that contain non-BMP characters), and it
actually improved performance quite notably, despite some additional
complications.

Performance optimization is a funny science :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-26 Thread Jon Ribbens via Python-list
On 2021-05-26, Alan Gauld  wrote:
> On 25/05/2021 23:23, Terry Reedy wrote:
>> In CPython's Flexible String Representation all characters in a string 
>> are stored with the same number of bytes, depending on the largest 
>> codepoint.
>
> I'm learning lots of new things in this thread!
>
> Does that mean that if I give Python a UTF8 string that is mostly single
> byte characters but contains one 4-byte character that Python will store
> the string as all 4-byte characters?
>
> If so, doesn't that introduce a pretty big storage overhead for
> large strings?

Memory is cheap ;-)

> I confess I had just assumed the unicode strings were stored
> in native unicode UTF8 format.

If you do that then indexing and slicing strings becomes very slow.
-- 
https://mail.python.org/mailman/listinfo/python-list


exit() builtin, was Re: imaplib: is this really so unwieldy?

2021-05-26 Thread Peter Otten

On 26/05/2021 01:02, Cameron Simpson wrote:

On 25May2021 15:53, Dennis Lee Bieber  wrote:

On Tue, 25 May 2021 19:21:39 +0200, hw  declaimed the
following:

Oh ok, it seemed to be fine.  Would it be the right way to do it with
sys.exit()?  Having to import another library just to end a program
might not be ideal.


I've never had to use sys. for exit...

C:\Users\Wulfraed>python
Python ActivePython 3.8.2 (ActiveState Software Inc.) based on
on win32
Type "help", "copyright", "credits" or "license" for more information.

exit()




I have learned a new thing today.

Regardless, hw didn't call it, just named it :-)


exit() is inserted into the built-ins by site.py. This means it may not 
be available:


PS D:\> py -c "exit('bye ')"
bye
PS D:\> py -S -c "exit('bye ')"
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'exit' is not defined

I have no idea if this is of any practical relevance...

--
https://mail.python.org/mailman/listinfo/python-list


Re: string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-26 Thread Tim Chase
On 2021-05-26 08:18, Alan Gauld via Python-list wrote:
> Does that mean that if I give Python a UTF8 string that is mostly
> single byte characters but contains one 4-byte character that
> Python will store the string as all 4-byte characters?

As best I understand it, yes:  the cost of each "character" in a
string is the same for the entire string, so even one lone 4-byte
character in an otherwise 1-byte-character string is enough to push
the whole string to 4-byte characters.  Doesn't effect other strings
though (so if you had a pure 7-bit string and a unicode string, the
former would still be 1-byte-per-char…it's not a global aspect)

If you encode these to a UTF8 byte-string, you'll get the space
savings you seek, but at the cost of sensible O(1) indexing.

Both are a trade-off, and if your data consists mostly of 7-bit ASCII
characters, or lots of small strings, the overhead is less pronounced
than if you have one single large blob of text as a string.

> If so, doesn't that introduce a pretty big storage overhead for
> large strings?

Yes.  Though such large strings tend to be more rare, largely because
they become unweildy for other reasons.

-tkc


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-26 Thread Chris Angelico
On Wed, May 26, 2021 at 10:04 PM Alan Gauld via Python-list
 wrote:
>
> On 25/05/2021 23:23, Terry Reedy wrote:
>
> > In CPython's Flexible String Representation all characters in a string
> > are stored with the same number of bytes, depending on the largest
> > codepoint.
>
> I'm learning lots of new things in this thread!
>
> Does that mean that if I give Python a UTF8 string that is mostly single
> byte characters but contains one 4-byte character that Python will store
> the string as all 4-byte characters?

Nitpick: It won't be "a UTF-8 string"; it will be "a Unicode string".
UTF-8 is a scheme for representing Unicode as a series of bytes, so if
something is UTF-8, it'll be like b'Stra\xc3\x9fe' (with two bytes
representing one non-ASCII character), whereas the corresponding
Unicode string is 'Stra\xdfe' with a single character. Or, if it were
beyond the first 256 characters, '\u2026' is an ellipsis,
b'\xe2\x80\xa6' is a UTF-8 representation of that same character. And
if it's beyond the BMP, then '\U0001F921' is one of the few non-ASCII
characters that you can legitimately write off as a "funny character",
and b'\xf0\x9f\xa4\xa1' is the UTF-8 byte sequence that would carry
that.

So. Yes, if you give Python a large ASCII string with a single non-BMP
character, the entire string *will* be stored as four-byte characters.

(Or, to nitpick against myself: CPython will do this. Other Python
implementations are free to do differently, and for instance, uPy
actually uses UTF-8 like you were predicting. For the rest of this
post, when I say "Python", I actually mean "CPython 3.3 or later".)

> If so, doesn't that introduce a pretty big storage overhead for
> large strings?
>
> >
> >  >>> sys.getsizeof('\U0001')
> > 80
> >  >>> sys.getsizeof('\U0001'*2)
> > 84
> >  >>> sys.getsizeof('a\U0001')
> > 84

Correct. Each additional character is going to cost you four bytes.

> Which is what this seems to be saying.
>
> I confess I had just assumed the unicode strings were stored
> in native unicode UTF8 format.
>

UTF-8 isn't native any more than any other encoding. It's a good
compact format for transmission, but it's quite inefficient for
manipulation. Python opts to spend some memory in order to improve
time, because that's usually the correct tradeoff to make - it means
that indexing in a large string is fast, slicing a large string is
fast, etc, etc, etc.

Also, the truth is that, *in practice*, very few strings will pay this
sort of penalty. If you have a whole lot of (say) Chinese text,
there's going to be a small proportion of ASCII text, but most of the
text is going to be wider characters. Working with most European
languages will require the use of the BMP (which means 16-bit text),
but not anything beyond. And if someone's going to use one emoji from
the supplemental planes (which would require 32-bit text), it's fairly
likely that they'll use multiple.

And if you look at all strings in the Python interpreter, the vast
majority of them will be ASCII-only, getting optimized all the way
down to a single byte. Remember, every module-level variable is stored
in that module's dictionary, keyed by its name - and *most* variable
names in Python are ASCII.

So while it's true that, in theory, a single wide character can cost
you a lot of memory... in practice, this is still a lot more compact,
overall, than storing all strings in UCS-2.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


string storage [was: Re: imaplib: is this really so unwieldy?]

2021-05-26 Thread Alan Gauld via Python-list
On 25/05/2021 23:23, Terry Reedy wrote:

> In CPython's Flexible String Representation all characters in a string 
> are stored with the same number of bytes, depending on the largest 
> codepoint.

I'm learning lots of new things in this thread!

Does that mean that if I give Python a UTF8 string that is mostly single
byte characters but contains one 4-byte character that Python will store
the string as all 4-byte characters?

If so, doesn't that introduce a pretty big storage overhead for
large strings?

> 
>  >>> sys.getsizeof('\U0001')
> 80
>  >>> sys.getsizeof('\U0001'*2)
> 84
>  >>> sys.getsizeof('a\U0001')
> 84

Which is what this seems to be saying.

I confess I had just assumed the unicode strings were stored
in native unicode UTF8 format.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-26 Thread Chris Angelico
On Wed, May 26, 2021 at 4:21 PM Grant Edwards  wrote:
>
> On 2021-05-25, Dennis Lee Bieber  wrote:
>
> >>Oh ok, it seemed to be fine.  Would it be the right way to do it with
> >>sys.exit()?  Having to import another library just to end a program
> >>might not be ideal.
> >
> >   I've never had to use sys. for exit...
>
> I would have sworn you used to have to import sys to use exit(). Am I
> misremembering?
>
> Apparently exit() and sys.exit() aren't the same, so what is the
> difference between the builtin exit and sys.exit?
>

exit() is designed to be used interactively, so, among other things,
it also has a helpful repr:

>>> exit
Use exit() or Ctrl-D (i.e. EOF) to exit

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-26 Thread Grant Edwards
On 2021-05-25, Dennis Lee Bieber  wrote:

>>Oh ok, it seemed to be fine.  Would it be the right way to do it with 
>>sys.exit()?  Having to import another library just to end a program 
>>might not be ideal.
>
>   I've never had to use sys. for exit...

I would have sworn you used to have to import sys to use exit(). Am I
misremembering?

Apparently exit() and sys.exit() aren't the same, so what is the
difference between the builtin exit and sys.exit?

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-26 Thread Grant Edwards
On 2021-05-25, Dennis Lee Bieber  wrote:
> On Tue, 25 May 2021 19:21:39 +0200, hw  declaimed the
> following:
>
>
>>
>>Oh ok, it seemed to be fine.  Would it be the right way to do it with 
>>sys.exit()?  Having to import another library just to end a program 
>>might not be ideal.
>
>   I've never had to use sys. for exit...
>
> C:\Users\Wulfraed>python
> Python ActivePython 3.8.2 (ActiveState Software Inc.) based on
>  on win32
> Type "help", "copyright", "credits" or "license" for more information.
 exit()
>
> C:\Users\Wulfraed>python

According to the docs (and various other sources), the global variable
"exit" is provided by the site module and is only for use at the
interactive prompt -- it should not be used in programs.

  https://docs.python.org/3/library/constants.html#exit

I get the impression that real programs should not assume that the
site module has been pre-loaded during startup.

--
Grant

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-25 Thread Cameron Simpson
On 25May2021 19:21, hw  wrote:
>On 5/25/21 11:38 AM, Cameron Simpson wrote:
>>On 25May2021 10:23, hw  wrote:
>>>if status != 'OK':
>>>print('Login failed')
>>>exit
>>
>>Your "exit" won't do what you want. I expect this code to raise a
>>NameError exception here (you've not defined "exit"). That _will_ abort
>>the programme, but in a manner indicating that you're used an unknown
>>name.  You probably want:
>>
>> sys.exit(1)
>>
>>You'll need to import "sys".
>
>Oh ok, it seemed to be fine.  Would it be the right way to do it with 
>sys.exit()?  Having to import another library just to end a program 
>might not be ideal.

To end a programme early, yes. (sys.exit() actually just raises a 
particular exception, BTW.)

I usually write a distinct main function, so in that case one can just 
"return". After all, what seems an end-of-life circumstance in a 
standalone script like yours is just an "end this function" circumstance 
when viewed as a function, and that also lets you _call_ the main 
programme from some outer thing. Wouldn't want that outer thing 
cancelled, if it exists.

My usual boilerplate for a module with a main programme looks like this:

import sys
..
def main(argv):
... main programme, return like any other function ...
 other code for the module - functions, classes etc ...
if __name__ == '__main__':
sys.exit(main(sys.argv))

which (a) puts main(0 up the top where it can be seen, (b) makes main() 
an ordinary function like any other (c) lets me just import that module 
elsewhere and (d) no globals - everything's local to main().

The __name__ boilerplate at the bottom is the magic which figures out if 
the module was imported (__name__ will be the import module name) or 
invoked from the command line like:

python -m my_module cmd-line-args...

in which case __name__ has the special value '__main__'. A historic 
mechanism which you will convince nobody to change.

You'd be surprised how useful it is to make almost any standalone 
programme a module like this - in the medium term it almost always pays 
off for me. Even just the discipline of shoving all the formerly-global 
variables in the main function brings lower-bugs benefits.

>>I've done little with IMAP. What's in msgnums here? Eg:
>> print(type(msgnums), repr(msgnums))
>>just so we all know what we're dealing with here.
>
> [b'']
>
>>>message_uuids = []
>>>for number in str(msgnums)[3:-2].split():
>>
>>This is very strange. [...]
>Yes, and I don't understand it.  'print(msgnums)' prints:
>
>[b'']
>
>when there are no messages and
>
>[b'1 2 3 4 5']

Chris has addressed this. msgnums is list of the data components of the 
IMAP response.  By going str(msgnums) you're not getting "the message 
numbers as text" you're getting what printing a list prints. Which is 
roughly Python code: the brakcets and the repr() of each list member.

Notice that the example code accessed msgnums[0] - that is the first 
data component, a bytes. That you _can_ convert to a string (under 
assumptions about the encoding).

By getting the "str" form of a list, you're forced into the weird [3:-2] 
hack to ttrim the ends. But it is just a hack for a transcription 
mistake, not a sane parse.

>So I was guessing that it might be an array containing a single a 
>string and that refering to the first element of the array turns into 
>a string with which split() can used.  But 'print(msgnums[0].split())' 
>prints
>
>[b'1', b'2', b'3', b'4', b'5']

msgnums[0] is bytes. You can do most str things with bytes (because that 
was found to be often useful) but you get bytes back from those 
operations as you'd hope.

>so I can only guess what that's supposed to mean: maybe an array of 
>many bytes?  The documentation[1] clearly says: "The message_set 
>options to commands below is a string [...]"

But that is the parameter to the _call_: your '(UID)' parameter.

>I also need to work with message uids rather than message numbers 
>because the numbers can easily change.  There doesn't seem to be a way 
>to do that with this library in python.

By asking for UIDs you're getting uids. Do they not work in subsequent 
calls?

>So it's all guesswork, and I gave up after a while and programmed what 
>I wanted in perl.  The documentation of this library sucks, and there 
>are worlds between it and the documentation for the libraries I used 
>with perl.

I think you're better of looking for another Python imap library. The 
imaplib was basic functionality to (a) access the rpotocol in basic form 
and (b) conceal the async stuff, since IMAP is an asynchronous protocol.

You can in fact subclass it to do better things. Other library might do 
thatm or they might have written their own protocol implementations.

>That doesn't mean I don't want to understand why this is so unwieldy. 
>It's all nice and smooth in perl.

But using what library? Something out of CPAN? Those are third party 
libraries, not Perl's presupplied 

Re: imaplib: is this really so unwieldy?

2021-05-25 Thread Cameron Simpson
On 25May2021 15:53, Dennis Lee Bieber  wrote:
>On Tue, 25 May 2021 19:21:39 +0200, hw  declaimed the
>following:
>>Oh ok, it seemed to be fine.  Would it be the right way to do it with
>>sys.exit()?  Having to import another library just to end a program
>>might not be ideal.
>
>   I've never had to use sys. for exit...
>
>C:\Users\Wulfraed>python
>Python ActivePython 3.8.2 (ActiveState Software Inc.) based on
> on win32
>Type "help", "copyright", "credits" or "license" for more information.
 exit()



I have learned a new thing today.

Regardless, hw didn't call it, just named it :-)

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-25 Thread Greg Ewing

On 26/05/21 5:21 am, hw wrote:

On 5/25/21 11:38 AM, Cameron Simpson wrote:

You'll need to import "sys".


aving to import another library just to end a program 
might not be ideal.


The sys module is built-in, so the import isn't really
loading anything, it's just giving you access to a
namespace.

But if you prefer, you can get the same result without
needing an import using

   raise SystemExit(1)

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-25 Thread Terry Reedy

On 5/25/2021 1:25 PM, MRAB wrote:

On 2021-05-25 16:41, Dennis Lee Bieber wrote:


In Python 3, strings are UNICODE, using 1, 2, or 4 bytes PER 
CHARACTER


This is CPython 3.3+ specific.  Before than, it depended on the OS.  I 
believe MicroPython uses utf-8 for strings.



(I don't recall if there is a 3-byte version).


There isn't.  It would save space but cost time.


If your input bytes are all
7-bit ASCII, then they map directly to a 1-byte per character string.


If your input bytes all have the upper bit 0 and they are interpreted as 
encoding ascii characters then they map to overhead + 1 byte per char


>>> sys.getsizeof(b''.decode('ascii'))
49
>>> sys.getsizeof(b'a'.decode('ascii'))
50
>>> sys.getsizeof(11*b'a'.decode('ascii'))
60


If
they contain any 8-bit upper half character they may map into a 2-byte 
per character string.


See below.


In CPython 3.3+:

U+..U+00FF are stored in 1 byte.
U+0100..U+ are stored in 2 bytes.
U+01..U+10 are stored in 4 bytes.


In CPython's Flexible String Representation all characters in a string 
are stored with the same number of bytes, depending on the largest 
codepoint.


>>> sys.getsizeof('\U0001')
80
>>> sys.getsizeof('\U0001'*2)
84
>>> sys.getsizeof('a\U0001')
84

Bytes in Python 3 are just a binary stream, which needs an 
encoding to produce characters.


Or any other Python object.

Use the wrong encoding (say ISO-Latin-1) when the 
data is really UTF-8 will result in garbage.


So does decoding bytes as text when the bytes encode something else,
such as an image ;-).


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-25 Thread Chris Angelico
On Wed, May 26, 2021 at 8:27 AM Grant Edwards  wrote:
>
> On 2021-05-25, MRAB  wrote:
> > On 2021-05-25 16:41, Dennis Lee Bieber wrote:
>
> >> In Python 3, strings are UNICODE, using 1, 2, or 4 bytes PER
> >> CHARACTER (I don't recall if there is a 3-byte version). If your
> >> input bytes are all 7-bit ASCII, then they map directly to a 1-byte
> >> per character string. If they contain any 8-bit upper half
> >> character they may map into a 2-byte per character string.
> >>
> > In CPython 3.3+:
> >
> > U+..U+00FF are stored in 1 byte.
> > U+0100..U+ are stored in 2 bytes.
> > U+01..U+10 are stored in 4 bytes.
>
> Are all characters in a string stored with the same "width"? IOW, does
> the presense of one Unicode character in the range U+01..U+10
> in a string that is otherwise all 7-bit ASCII values result in the
> entire string being stored 4-bytes per character? Or is the storage
> width variable within a single string?
>

Yes, any given string has a single width, which makes indexing fast.
The memory cost you're describing can happen, but apart from a BOM
widening an otherwise-ASCII string to 16-bit, there aren't many cases
where you'll get a single wide character in a narrow string. Usually,
if there are any wide characters, there'll be a good number of them
(for instance, text in any particular language will often have a lot
of characters from a block of characters allocated to it).

As an added benefit, keeping all characters the same width simplifies
string searching algorithms, if I'm reading the code correctly. Checks
like >>"foo" in some_string<< can widen the string "foo" to the width
of the target string and then search efficiently.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-25 Thread Grant Edwards
On 2021-05-25, MRAB  wrote:
> On 2021-05-25 16:41, Dennis Lee Bieber wrote:

>> In Python 3, strings are UNICODE, using 1, 2, or 4 bytes PER
>> CHARACTER (I don't recall if there is a 3-byte version). If your
>> input bytes are all 7-bit ASCII, then they map directly to a 1-byte
>> per character string. If they contain any 8-bit upper half
>> character they may map into a 2-byte per character string.
>> 
> In CPython 3.3+:
>
> U+..U+00FF are stored in 1 byte.
> U+0100..U+ are stored in 2 bytes.
> U+01..U+10 are stored in 4 bytes.

Are all characters in a string stored with the same "width"? IOW, does
the presense of one Unicode character in the range U+01..U+10
in a string that is otherwise all 7-bit ASCII values result in the
entire string being stored 4-bytes per character? Or is the storage
width variable within a single string?

--
Grant




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-25 Thread MRAB

On 2021-05-25 16:41, Dennis Lee Bieber wrote:

On Tue, 25 May 2021 10:23:41 +0200, hw  declaimed the
following:



So I'm forced to convert stuff from bytes to strings (which is weird 
because bytes are bytes) and to use regular expressions to extract the 
message-uids from what the functions return (which I shouldn't have to 
because when I'm asking a function to give me a uid, I expect it to 
return a uid).



In Python 3, strings are UNICODE, using 1, 2, or 4 bytes PER CHARACTER
(I don't recall if there is a 3-byte version). If your input bytes are all
7-bit ASCII, then they map directly to a 1-byte per character string. If
they contain any 8-bit upper half character they may map into a 2-byte per
character string.


In CPython 3.3+:

U+..U+00FF are stored in 1 byte.
U+0100..U+ are stored in 2 bytes.
U+01..U+10 are stored in 4 bytes.


Bytes in Python 3 are just a binary stream, which needs an encoding to
produce characters. Use the wrong encoding (say ISO-Latin-1) when the data
is really UTF-8 will result in garbage.



--
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-25 Thread hw

On 5/25/21 11:38 AM, Cameron Simpson wrote:

On 25May2021 10:23, hw  wrote:

I'm about to do stuff with emails on an IMAP server and wrote a program
using imaplib which, so far, gets the UIDs of the messages in the
inbox:


#!/usr/bin/python


I'm going to assume you're using Python 3.


Python 3.9.5


import imaplib
import re

imapsession = imaplib.IMAP4_SSL('imap.example.com', port = 993)

status, data = imapsession.login('user', 'password')
if status != 'OK':
print('Login failed')
exit


Your "exit" won't do what you want. I expect this code to raise a
NameError exception here (you've not defined "exit"). That _will_ abort
the programme, but in a manner indicating that you're used an unknown
name.  You probably want:

 sys.exit(1)

You'll need to import "sys".


Oh ok, it seemed to be fine.  Would it be the right way to do it with 
sys.exit()?  Having to import another library just to end a program 
might not be ideal.



messages = imapsession.select(mailbox = 'INBOX', readonly = True)
typ, msgnums = imapsession.search(None, 'ALL')


I've done little with IMAP. What's in msgnums here? Eg:

 print(type(msgnums), repr(msgnums))

just so we all know what we're dealing with here.


 [b'']


message_uuids = []
for number in str(msgnums)[3:-2].split():


This is very strange. Did you see the example at the end of the module
docs, it has this example code:

import getpass, imaplib

 M = imaplib.IMAP4()
 M.login(getpass.getuser(), getpass.getpass())
 M.select()
 typ, data = M.search(None, 'ALL')
 for num in data[0].split():
 typ, data = M.fetch(num, '(RFC822)')
 print('Message %s\n%s\n' % (num, data[0][1]))
 M.close()
 M.logout()


Yes, and I don't understand it.  'print(msgnums)' prints:

[b'']

when there are no messages and

[b'1 2 3 4 5']

So I was guessing that it might be an array containing a single a string 
and that refering to the first element of the array turns into a string 
with which split() can used.  But 'print(msgnums[0].split())' prints


[b'1', b'2', b'3', b'4', b'5']

so I can only guess what that's supposed to mean: maybe an array of many 
bytes?  The documentation[1] clearly says: "The message_set options to 
commands below is a string [...]"


I also need to work with message uids rather than message numbers 
because the numbers can easily change.  There doesn't seem to be a way 
to do that with this library in python.


So it's all guesswork, and I gave up after a while and programmed what I 
wanted in perl.  The documentation of this library sucks, and there are 
worlds between it and the documentation for the libraries I used with perl.


That doesn't mean I don't want to understand why this is so unwieldy. 
It's all nice and smooth in perl.


[1]: https://docs.python.org/3/library/imaplib.html


It is just breaking apart data[0] into strings which were separated by
whitespace in the response. And then using those same strings as keys
for the .fecth() call. That doesn't seem complex, and in fact is blind
to the format of the "message numbers" returned. It just takes what it
is handed and uses those to fetch each message.


That's not what the documentation says.


status, data = imapsession.fetch(number, '(UID)')
if status == 'OK':
match = re.match('.*\(UID (\d+)\)', str(data))

[...]

It's working (with Cyrus), but I have the feeling I'm doing it all
wrong because it seems so unwieldy.


IMAP's quite complex. Have you read RFC2060?

 https://datatracker.ietf.org/doc/html/rfc2060.html


Yes, I referred to it and it didn't become any more clear in combination 
with the documentation of the python library.



The imaplib library is probably a fairly basic wrapper for the
underlying protocol which provides methods for the basic client requests
and conceals the asynchronicity from the user for ease of (basic) use.


Skip Montanaro seems to say that the byte problem comes from the change 
from python 2 to 3 and there is a better library now: 
https://pypi.org/project/IMAPClient/


But the documentation seems even more sparse than the one for imaplib. 
Is it a general thing with python that libraries are not well documented?



Apparently the functions of imaplib return some kind of bytes while
expecting strings as arguments, like message numbers must be strings.
The documentation doesn't seem to say if message UIDs are supposed to
be integers or strings.


You can go a long way by pretending that they are opaque strings. That
they may be numeric in content can be irrelevant to you. treat them as
strings.


That's what I ended up doing.


So I'm forced to convert stuff from bytes to strings (which is weird
because bytes are bytes)


"bytes are bytes" is tautological.


which is a good thing


You're getting bytes for a few
reasons:

- the imap protocol largely talks about octets (bytes), but says they're
   text. For this reason a lot of stuff you pass as client parameters are
   strings, because strings are 

Re: imaplib: is this really so unwieldy?

2021-05-25 Thread Grant Edwards
On 2021-05-25, hw  wrote:

> I'm about to do stuff with emails on an IMAP server and wrote a program 
> using imaplib

My recollection of using imaplib a few years ago is that yes, it is
unweildy, oddly low-level, and rather un-Pythonic (excuse my
presumption in declaring what is and isn't "Pythonic").

I switched to using imaplib2 and found it much easier to use. It's a
higher-level wrapper for imaplib.

I think this is the currently maintained fork:

  https://github.com/jazzband/imaplib2

I haven't activly used either for several years, so things may have
changed...

--
Grant

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-25 Thread Skip Montanaro
> It's working (with Cyrus), but I have the feeling I'm doing it all wrong
> because it seems so unwieldy.

I have a program, Polly , which I
wrote to generate XKCD 936 passphrases. (I got the idea - and the name -
from Chris Angelico. See the README.) It builds its dictionary from emails
in my Gmail account which are tagged "polly" by a Gmail filter. I had put
it away for a few years, at which time it was still using Python 2. When I
came back to it, I wanted to update it to Python 3. As with so many 2-to-3
ports, the whole bytes/str problem was my stumbling block. Imaplib's API
(as you've discovered) is not the most Pythonic. I didn't spend much time
horsing around with it. Instead, I searched for higher-level packages,
eventually landing on IMAPClient .
Once I made the switch, things came together pretty quickly, due in large
part, I think, to its more sane API.

YMMV, but you're more than welcome to steal code from Polly.

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-25 Thread Chris Angelico
On Tue, May 25, 2021 at 8:21 PM Cameron Simpson  wrote:
> When you go:
>
> text = str(data)
>
> that is _assuming_ a particular text encoding stored in the data. You
> really ought to specify an encoding here. If you've not specified the
> CHARSET for things, 'ascii' would be a conservative choice. The IMAP RFC
> talks about what to expect in section 4 (Data Formats). There's quite a
> lot of possible response formats and I can understand imaplib not
> getting deeply into decoding these.

Worse than that: what you actually get is the repr of the bytes. That
might happen to look a lot like an ASCII decode, but if the string
contains unprintable characters, quotes, or anything outside of the
ASCII range, it's going to represent it as an escape code.

The best way to turn bytes into text is the decode method:

data.decode("UTF-8")

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: imaplib: is this really so unwieldy?

2021-05-25 Thread Cameron Simpson
On 25May2021 10:23, hw  wrote:
>I'm about to do stuff with emails on an IMAP server and wrote a program 
>using imaplib which, so far, gets the UIDs of the messages in the 
>inbox:
>
>
>#!/usr/bin/python

I'm going to assume you're using Python 3.

>import imaplib
>import re
>
>imapsession = imaplib.IMAP4_SSL('imap.example.com', port = 993)
>
>status, data = imapsession.login('user', 'password')
>if status != 'OK':
>print('Login failed')
>exit

Your "exit" won't do what you want. I expect this code to raise a 
NameError exception here (you've not defined "exit"). That _will_ abort 
the programme, but in a manner indicating that you're used an unknown 
name.  You probably want:

sys.exit(1)

You'll need to import "sys".

>messages = imapsession.select(mailbox = 'INBOX', readonly = True)
>typ, msgnums = imapsession.search(None, 'ALL')

I've done little with IMAP. What's in msgnums here? Eg:

print(type(msgnums), repr(msgnums))

just so we all know what we're dealing with here.

>message_uuids = []
>for number in str(msgnums)[3:-2].split():

This is very strange. Did you see the example at the end of the module 
docs, it has this example code:

import getpass, imaplib

M = imaplib.IMAP4()
M.login(getpass.getuser(), getpass.getpass())
M.select()
typ, data = M.search(None, 'ALL')
for num in data[0].split():
typ, data = M.fetch(num, '(RFC822)')
print('Message %s\n%s\n' % (num, data[0][1]))
M.close()
M.logout()

It is just breaking apart data[0] into strings which were separated by 
whitespace in the response. And then using those same strings as keys 
for the .fecth() call. That doesn't seem complex, and in fact is blind 
to the format of the "message numbers" returned. It just takes what it 
is handed and uses those to fetch each message.

>status, data = imapsession.fetch(number, '(UID)')
>if status == 'OK':
>match = re.match('.*\(UID (\d+)\)', str(data))
[...]
>It's working (with Cyrus), but I have the feeling I'm doing it all 
>wrong because it seems so unwieldy.

IMAP's quite complex. Have you read RFC2060?

https://datatracker.ietf.org/doc/html/rfc2060.html

The imaplib library is probably a fairly basic wrapper for the 
underlying protocol which provides methods for the basic client requests 
and conceals the asynchronicity from the user for ease of (basic) use.

>Apparently the functions of imaplib return some kind of bytes while 
>expecting strings as arguments, like message numbers must be strings.  
>The documentation doesn't seem to say if message UIDs are supposed to 
>be integers or strings.

You can go a long way by pretending that they are opaque strings. That 
they may be numeric in content can be irrelevant to you. treat them as 
strings.

>So I'm forced to convert stuff from bytes to strings (which is weird 
>because bytes are bytes)

"bytes are bytes" is tautological. You're getting bytes for a few 
reasons:

- the imap protocol largely talks about octets (bytes), but says they're
  text. For this reason a lot of stuff you pass as client parameters are
  strings, because strings are text.

- text may be encoded as bytes in many ways, and without knowing the
  encoding, you can't extract text (strings) from bytes

- the imaplib library may date from Python 2, where the str type was
  essentially a byte sequence. In Python 3 a str is a sequence of
  Unicode code points, and you translate to/from bytes if you need to
  work with bytes.

Anyway, the IMAP response are bytes containing text. You get a lot of 
bytes.

When you go:

text = str(data)

that is _assuming_ a particular text encoding stored in the data. You 
really ought to specify an encoding here. If you've not specified the 
CHARSET for things, 'ascii' would be a conservative choice. The IMAP RFC 
talks about what to expect in section 4 (Data Formats). There's quite a 
lot of possible response formats and I can understand imaplib not 
getting deeply into decoding these.

>and to use regular expressions to extract the message-uids from what 
>the functions return (which I shouldn't have to because when I'm asking 
>a function to give me a uid, I expect it to return a uid).

No, you're asking the IMAP _protocol_ to return you UIDs. The module 
itself doesn't parse what you ask for in the fetch results, and 
therefore it can't decode the response (data bytes) into some higher 
level thing (such as UIDs in your case, but you can ask for all sorts of 
weird stuff with IMAP).

So having passed '(UID)' to the SEARCH request, you now need to parse 
the response.

>This so totally awkward and unwieldy and involves so much overhead 
>that I must be doing this wrong.  But am I?  How would I do this right?

Well, you _could_ get immersed in the nitty gritty of the IMAP protocol 
and the imaplib module, _or_ you could see if someone else has done some 
work to make this easier by writing a higher level library. A search at 
pypi.org for "imap" found a lot of stuff. 

imaplib: is this really so unwieldy?

2021-05-25 Thread hw



Hi,

I'm about to do stuff with emails on an IMAP server and wrote a program 
using imaplib which, so far, gets the UIDs of the messages in the inbox:



#!/usr/bin/python

import imaplib
import re

imapsession = imaplib.IMAP4_SSL('imap.example.com', port = 993)

status, data = imapsession.login('user', 'password')
if status != 'OK':
print('Login failed')
exit

messages = imapsession.select(mailbox = 'INBOX', readonly = True)
typ, msgnums = imapsession.search(None, 'ALL')
message_uuids = []
for number in str(msgnums)[3:-2].split():
status, data = imapsession.fetch(number, '(UID)')
if status == 'OK':
match = re.match('.*\(UID (\d+)\)', str(data))
message_uuids.append(match.group(1))
for uid in message_uuids:
print('UID %5s' % uid)

imapsession.close()
imapsession.logout()


It's working (with Cyrus), but I have the feeling I'm doing it all wrong 
because it seems so unwieldy.  Apparently the functions of imaplib 
return some kind of bytes while expecting strings as arguments, like 
message numbers must be strings.  The documentation doesn't seem to say 
if message UIDs are supposed to be integers or strings.


So I'm forced to convert stuff from bytes to strings (which is weird 
because bytes are bytes) and to use regular expressions to extract the 
message-uids from what the functions return (which I shouldn't have to 
because when I'm asking a function to give me a uid, I expect it to 
return a uid).


This so totally awkward and unwieldy and involves so much overhead that 
I must be doing this wrong.  But am I?  How would I do this right?

--
https://mail.python.org/mailman/listinfo/python-list