Re: How can I create customized classes that have similar properties as 'str'?

2007-12-06 Thread samwyse
On Nov 24, 6:38 pm, Hrvoje Niksic [EMAIL PROTECTED] wrote:
 samwyse [EMAIL PROTECTED] writes:
  create a hash that maps your keys to themselves, then use the values
  of that hash as your keys.

 The atom function you describe already exists under the name
 intern.

D'oh!  That's what I get for not memorizing Non-essential Built-in
Functions.

In my defense, however, my function will work with anything that can
be used as a dictionary key (strings, tuples, frozen sets, etc), not
just character strings; thus we return to the original:

 a=(1,2,3)
 b=(1,1+1,1+1+1)
 a == b
True
 a is b
False
 atom(a) is atom(b)
True
 intern(a) is intern(b)
Traceback (most recent call last):
  File pyshell#33, line 1, in ?
intern(a) is intern(b)
TypeError: intern() argument 1 must be string, not tuple
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-12-03 Thread zooko
On Nov 24, 4:44 am, Licheng Fang [EMAIL PROTECTED] wrote:

 Yes, millions. In my natural language processing tasks, I almost
 always need to define patterns, identify their occurrences in a huge
 data, and count them. Say, I have a big text file, consisting of
 millions of words, and I want to count the frequency of trigrams:

I have some experience with this, helping my wife do computational
linguistics.

(I also have quite a lot of experience with similar things in my day
job, which is a decentralized storage grid written in Python.)

Unfortunately, Python is not a perfect tool for the job because, as
you've learned, Python isn't overly concerned about conserving
memory.  Each object has substantial overhead associated with it
(including each integer, each string, each tuple, ...), and dicts add
overhead due to being sparsely filled.  You should do measurements
yourself to get results for your local CPU and OS, but I found, for
example, that storing 20-byte keys and 8-byte values as a Python dict
of Python strings took about 100 bytes per entry.

Try tokenizing your trigrams by defining a dict from three unigrams
to a sequentially allocated integer trigram id (also called a
trigram token), and a reverse dict which goes from a trigram id to
the three unigrams.  Whenever you create a new set of three Python
objects representing unigrams, you can pass them through the first
mapping to get the trigram id and then free up the original three
Python objects.  If you do this multiple times, you get multiple
references to the same integer object for the trigram id.

My wife and I tried this, but it still wasn't compact enough to
process her datasets in a mere 4 GiB of RAM.

One tool that might help is PyJudy:

http://www.dalkescientific.com/Python/PyJudy.html

Judy is a delightfully memory-efficient, fast, and flexible data
structure.  In the specific example of trigram counting (which is also
what my wife was doing), you can, for example, assign each to each
unigram an integer, and assuming that you have less than two million
unigrams you can pack three unigrams into a 64-bit integer...  Hm,
actually at this point my wife and I stopped using Python and rewrote
it in C using JudyTrees.  (At the time, PyJudy didn't exist.)

If you are interested, please e-mail my wife, Amber Wilcox-O'Hearn and
perhaps she'll share the resulting C code with you.

Regards,

Zooko Wilcox-O'Hearn
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-27 Thread Licheng Fang
On Nov 27, 10:45 am, Steven D'Aprano
[EMAIL PROTECTED] wrote:
 On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote:
  I mentioned trigram counting as an illustrative case. In fact, you'll
  often need to define patterns more complex than that, and tens of
  megabytes of text may generate millions of them, and I've observed they
  quickly  ate up the 8G memory of a workstation in a few minutes.
  Manipulating these patterns can be tricky, you can easily waste a lot of
  memory without taking extra care. I just thought if I define my pattern
  class with this 'atom' property, coding efforts could be easier later.

 I'm just not getting the same results as you when I try this. I'm finding
 that with no extra effort at all, it just works.

 The size of your corpus is not important. Neither is the complexity of
 how you generate the patterns. What's important is the number of patterns
 you produce, and millions isn't that huge a number, even without
 interning the strings.

 Obviously I'm not running your code, but I can build a dict with millions
 of patterns, from hundreds of megabytes of text, on a PC with just 1GB of
 memory and not run into any serious problems.

 I've just processed roughly 240MB of random emails, generating n-grams up
 to length 5. The emails include binary attachments and HTML etc., so I'm
 getting lots of patterns that don't normally exist in natural languages
 (e.g. 71 occurrences of 'qqq', and 5 of ''). As I said, my PC has
 only 1GB, and that's being shared with about a dozen other apps (including
 such memory hogs as Firefox).

 Results? 64939962 patterns found, of which 17031467 are unique. There's
 paging, yes, and my PC runs a little slow when I try to switch from one
 running application to another, but nothing unusable. Opening a dozen
 YouTube videos at once impacts performance worse.

 I can't think what you're doing to use up 8GB of RAM for merely
 millions of strings, even if you are keeping two, three, ten redundant
 copies. Assuming an average length of twenty bytes per pattern (you said
 trigrams, but I'd rather over-estimate than under), and even assuming
 that only half the 8GB are available to Python, you should be able to
 store something of the order of one hundred million patterns easily:

My task is identifying sequences of tokens (phrases) that are possible
tranlations of each other from a bilingual corpus. I need to check all
the subsequences of a sentence to get the possible phrase pairs. This
makes the problem different from n-gram counting in that the number of
possible phrases doesn't grow linearly with n, but approximately with
n^2. (n being the sentence length.) My first version of the program
consumes almost twice as much memory as the current one because I
discovered in doing different statistics I was regenerating the
patterns, and the counting dictionaries ended up with duplicated
pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I
can restrict the pattern class to not generate identical instances? So
I can avoid such subtle but significant bugs.


 4 bytes for a pointer plus 20 bytes for the string = 24 bytes

 4*1024**3 / 24 = 178,956,970

 (This is a ballpark figure. Real Python strings will have additional
 overhead.)




 If you're running into problems with merely millions of patterns, then
 you're doing worse by probably two orders of magnitude.

 I don't think that the problem lies where you think it does. If you have
 a dict with millions of keys, it doesn't matter how many times each
 pattern exists in the corpus because the key only exists *once* in the
 dict. Duplicate the dict, or generate it from scratch even, and at worse
 you double the memory used by the keys. And probably not even that.

 The only thing I can think of that might explain why you're using so much
 memory is if you are generating *all* the patterns up front, say in a
 list, before adding them to the dict:

 # Generate one massive list of patterns containing many duplicates
 patterns = make_patterns(text)
 # returns a massive list like ['fre', 'req', 'equ', 'que' ...]
 d = {}
 for pattern in patterns:
 d[pattern] = d.get(pattern, 0) + 1


No, I wasn't doing that.
BTW, do you think the pattern counting code can avoid hashing the
pattern twice? Is there a way to do that when the dictionary values
are of a primitive type?

 Notice that the real killer in the above algorithm is that you need
 enough storage, not just for the unique patterns, but for EVERY separate
 occurrence of each pattern. Nothing to do with how dicts operate, and
 everything to do with the algorithm you (hypothetically) are using.

 If that's what you're doing, then no wonder you're running out of memory.
 With 200MB of text, you have 209715198 trigrams in your list. The
 pointers alone will take almost a gigabyte, assuming 32-bit pointers.

 If this is your approach, interning the strings won't save you. You
 almost certainly should change to a lazy approach, and use a 

Re: How can I create customized classes that have similar properties as 'str'?

2007-11-27 Thread Chris Mellon
On Nov 27, 2007 7:16 AM, Licheng Fang [EMAIL PROTECTED] wrote:
 On Nov 27, 10:45 am, Steven D'Aprano

 [EMAIL PROTECTED] wrote:
  On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote:
   I mentioned trigram counting as an illustrative case. In fact, you'll
   often need to define patterns more complex than that, and tens of
   megabytes of text may generate millions of them, and I've observed they
   quickly  ate up the 8G memory of a workstation in a few minutes.
   Manipulating these patterns can be tricky, you can easily waste a lot of
   memory without taking extra care. I just thought if I define my pattern
   class with this 'atom' property, coding efforts could be easier later.
 
  I'm just not getting the same results as you when I try this. I'm finding
  that with no extra effort at all, it just works.
 
  The size of your corpus is not important. Neither is the complexity of
  how you generate the patterns. What's important is the number of patterns
  you produce, and millions isn't that huge a number, even without
  interning the strings.
 
  Obviously I'm not running your code, but I can build a dict with millions
  of patterns, from hundreds of megabytes of text, on a PC with just 1GB of
  memory and not run into any serious problems.
 
  I've just processed roughly 240MB of random emails, generating n-grams up
  to length 5. The emails include binary attachments and HTML etc., so I'm
  getting lots of patterns that don't normally exist in natural languages
  (e.g. 71 occurrences of 'qqq', and 5 of ''). As I said, my PC has
  only 1GB, and that's being shared with about a dozen other apps (including
  such memory hogs as Firefox).
 
  Results? 64939962 patterns found, of which 17031467 are unique. There's
  paging, yes, and my PC runs a little slow when I try to switch from one
  running application to another, but nothing unusable. Opening a dozen
  YouTube videos at once impacts performance worse.
 
  I can't think what you're doing to use up 8GB of RAM for merely
  millions of strings, even if you are keeping two, three, ten redundant
  copies. Assuming an average length of twenty bytes per pattern (you said
  trigrams, but I'd rather over-estimate than under), and even assuming
  that only half the 8GB are available to Python, you should be able to
  store something of the order of one hundred million patterns easily:

 My task is identifying sequences of tokens (phrases) that are possible
 tranlations of each other from a bilingual corpus. I need to check all
 the subsequences of a sentence to get the possible phrase pairs. This
 makes the problem different from n-gram counting in that the number of
 possible phrases doesn't grow linearly with n, but approximately with
 n^2. (n being the sentence length.) My first version of the program
 consumes almost twice as much memory as the current one because I
 discovered in doing different statistics I was regenerating the
 patterns, and the counting dictionaries ended up with duplicated
 pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I
 can restrict the pattern class to not generate identical instances? So
 I can avoid such subtle but significant bugs.


Implement __hash__ and __eq__ on your pattern class. If the same
pattern compares equal and hashes the same then it will be a matching
key as far as the dict is concerned and will only be stored once.
This is probably cheaper than explicit interning anyway (you don't
need to search an intern table).


  The only thing I can think of that might explain why you're using so much
  memory is if you are generating *all* the patterns up front, say in a
  list, before adding them to the dict:
 
  # Generate one massive list of patterns containing many duplicates
  patterns = make_patterns(text)
  # returns a massive list like ['fre', 'req', 'equ', 'que' ...]
  d = {}
  for pattern in patterns:
  d[pattern] = d.get(pattern, 0) + 1
 

 No, I wasn't doing that.
 BTW, do you think the pattern counting code can avoid hashing the
 pattern twice? Is there a way to do that when the dictionary values
 are of a primitive type?


Hashing isn't really an expensive operation. On strings it's even
cached on the object. If you implement your own __hash__ method you
can do the same, but I wouldn't bother unless you benchmark it and
discover that hashing is a bottleneck.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-26 Thread Steven D'Aprano
On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote:

 I mentioned trigram counting as an illustrative case. In fact, you'll
 often need to define patterns more complex than that, and tens of
 megabytes of text may generate millions of them, and I've observed they
 quickly  ate up the 8G memory of a workstation in a few minutes.
 Manipulating these patterns can be tricky, you can easily waste a lot of
 memory without taking extra care. I just thought if I define my pattern
 class with this 'atom' property, coding efforts could be easier later.

I'm just not getting the same results as you when I try this. I'm finding 
that with no extra effort at all, it just works.

The size of your corpus is not important. Neither is the complexity of 
how you generate the patterns. What's important is the number of patterns
you produce, and millions isn't that huge a number, even without 
interning the strings.

Obviously I'm not running your code, but I can build a dict with millions 
of patterns, from hundreds of megabytes of text, on a PC with just 1GB of 
memory and not run into any serious problems.

I've just processed roughly 240MB of random emails, generating n-grams up
to length 5. The emails include binary attachments and HTML etc., so I'm
getting lots of patterns that don't normally exist in natural languages
(e.g. 71 occurrences of 'qqq', and 5 of ''). As I said, my PC has 
only 1GB, and that's being shared with about a dozen other apps (including
such memory hogs as Firefox).

Results? 64939962 patterns found, of which 17031467 are unique. There's 
paging, yes, and my PC runs a little slow when I try to switch from one 
running application to another, but nothing unusable. Opening a dozen 
YouTube videos at once impacts performance worse.

I can't think what you're doing to use up 8GB of RAM for merely 
millions of strings, even if you are keeping two, three, ten redundant 
copies. Assuming an average length of twenty bytes per pattern (you said 
trigrams, but I'd rather over-estimate than under), and even assuming 
that only half the 8GB are available to Python, you should be able to 
store something of the order of one hundred million patterns easily:

4 bytes for a pointer plus 20 bytes for the string = 24 bytes

4*1024**3 / 24 = 178,956,970

(This is a ballpark figure. Real Python strings will have additional 
overhead.)

If you're running into problems with merely millions of patterns, then 
you're doing worse by probably two orders of magnitude.

I don't think that the problem lies where you think it does. If you have 
a dict with millions of keys, it doesn't matter how many times each 
pattern exists in the corpus because the key only exists *once* in the 
dict. Duplicate the dict, or generate it from scratch even, and at worse 
you double the memory used by the keys. And probably not even that.

The only thing I can think of that might explain why you're using so much
memory is if you are generating *all* the patterns up front, say in a 
list, before adding them to the dict:

# Generate one massive list of patterns containing many duplicates
patterns = make_patterns(text)
# returns a massive list like ['fre', 'req', 'equ', 'que' ...]
d = {}
for pattern in patterns:
d[pattern] = d.get(pattern, 0) + 1


Notice that the real killer in the above algorithm is that you need 
enough storage, not just for the unique patterns, but for EVERY separate 
occurrence of each pattern. Nothing to do with how dicts operate, and 
everything to do with the algorithm you (hypothetically) are using.

If that's what you're doing, then no wonder you're running out of memory. 
With 200MB of text, you have 209715198 trigrams in your list. The 
pointers alone will take almost a gigabyte, assuming 32-bit pointers.

If this is your approach, interning the strings won't save you. You 
almost certainly should change to a lazy approach, and use a generator to 
make each pattern as needed, then thrown away:

def make_patterns(s, n=3):
Yield n-grams.
if len(s) = n:
yield s
else:
for i in range(len(s)-n+1):
yield s[i:i+n]

d = {}
fp = open('corpus', 'r')
for line in fp:
for word in line.split():
for pattern in make_patterns(word.strip()):
d[pattern] = d.get(pattern, 0) + 1



Obviously I haven't seen your code and don't actually know what you are 
doing. If my 1GB machine can deal with a dict of 17 million unique keys 
from a corpus of 240MB with no special memory management, your 8GB 
machine should too.



-- 
Steven.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-25 Thread Peter Otten
Steven D'Aprano wrote:

 store = {}
 def atom(str):
  global store
  if str not in store:
  store[str] = str
  return store[str]
 
 Oh lordy, that's really made my day! That's the funniest piece of code 
 I've seen for a long time! Worthy of being submitted to the DailyWTF.

Here's a script to show atom()'s effect on memory footprint:

$ cat atom.py
import sys
data = [1]*1000
items = []
cache = {}
if -a in sys.argv:
def atom(obj):
try:
return cache[obj]
except KeyError:
cache[obj] = obj
return obj
else:
def atom(obj):
return obj
try:
while 1:
items.append(atom(tuple(data)))
except MemoryError:
print len(items)
$ ulimit -v 5000
$ python atom.py
226
$ python atom.py -a
185742

So if you are going to submit Sam's function make sure to bundle it with
this little demo...

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-25 Thread Licheng Fang
On Nov 25, 5:59 am, Steven D'Aprano [EMAIL PROTECTED]
cybersource.com.au wrote:
 On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:
  On Nov 24, 7:05 pm, Bjoern Schliessmann usenet-
  [EMAIL PROTECTED] wrote:
  Licheng Fang wrote:
   I find myself frequently in need of classes like this for two
   reasons. First, it's efficient in memory.

  Are you using millions of objects, or MB size objects? Otherwise, this
  is no argument.

  Yes, millions.

 Oh noes!!! Not millions of words That's like, oh, a few tens of
 megabytes1! How will a PC with one or two gigabytes of RAM cope?

 Tens of megabytes is not a lot of data.

 If the average word size is ten characters, then one million words takes
 ten million bytes, or a little shy of ten megabytes. Even if you are
 using four-byte characters, you've got 40 MB, still a moderate amount of
 data on a modern system.

I mentioned trigram counting as an illustrative case. In fact, you'll
often need to define patterns more complex than that, and tens of
megabytes of text may generate millions of them, and I've observed
they quickly  ate up the 8G memory of a workstation in a few minutes.
Manipulating these patterns can be tricky, you can easily waste a lot
of memory without taking extra care. I just thought if I define my
pattern class with this 'atom' property, coding efforts could be
easier later.


  In my natural language processing tasks, I almost always
  need to define patterns, identify their occurrences in a huge data, and
  count them. Say, I have a big text file, consisting of millions of
  words, and I want to count the frequency of trigrams:

  trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]

  I can save the counts in a dict D1. Later, I may want to recount the
  trigrams, with some minor modifications, say, doing it on every other
  line of the input file, and the counts are saved in dict D2. Problem is,
  D1 and D2 have almost the same set of keys (trigrams of the text), yet
  the keys in D2 are new instances, even though these keys probably have
  already been inserted into D1. So I end up with unnecessary duplicates
  of keys. And this can be a great waste of memory with huge input data.

 All these keys will almost certainly add up to only a few hundred
 megabytes, which is a reasonable size of data but not excessive. This
 really sounds to me like a case of premature optimization. I think you
 are wasting your time solving a non-problem.

 [snip]

  Wow, I didn't know this. But exactly how Python manage these strings? My
  interpretator gave me such results:

  a = 'this'
  b = 'this'
  a is b
  True
  a = 'this is confusing'
  b = 'this is confusing'
  a is b
  False

 It's an implementation detail. You shouldn't use identity testing unless
 you actually care that two names refer to the same object, not because
 you want to save a few bytes. That's poor design: it's fragile,
 complicated, and defeats the purpose of using a high-level language like
 Python.

 --
 Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-25 Thread Steven D'Aprano
On Sun, 25 Nov 2007 10:39:38 +0100, Peter Otten wrote:

 So if you are going to submit Sam's function make sure to bundle it with
 this little demo...

Well Peter, I was going to reply with a comment about not changing the 
problem domain (tuples of ints to trigrams from a text file for natural 
language processing, that is, three character alphanumeric strings), and 
that if you re-did your test with strings (as I did) you would see 
absolutely no difference. What I was going to say was Tuples aren't 
interned. Short strings that look like identifiers are. Jumping through 
hoops to cache things which are already cached is not productive 
programming.

But then I dug a little deeper, and disassembled the code I was running, 
and discovered that I was being fooled by the Python compiler's constant-
folding, and if I took steps to defeat the optimizer, the effect I was 
seeing disappeared, and I got the same results as you.

Well. So I've learned something new: Python doesn't intern strings in the 
way I thought it did. I don't quite know *how* it decides which strings 
to intern and which ones not to, but at least I've learnt that what I 
thought was true is not true.

So I offer my apology to Samwyse, the caching code isn't as redundant and 
silly as it appears, and humbly tuck into this nice steaming plate of 
crow.

Somebody pass the salt please.



-- 
Steven.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-25 Thread MonkeeSage
On Nov 24, 6:42 pm, Steven D'Aprano [EMAIL PROTECTED]
cybersource.com.au wrote:

 This has nothing, absolutely NOTHING, to do with memoization. Memoization
 trades off memory for time, allowing slow functions to return results
 faster at the cost of using more memory. The OP wants to save memory, not
 use more of it.

Not to beat a dead horse, but memoization can significantly minimize
memory usage, given a large data set with redundant elements (as the
OP seems to suggest [e.g., calculating the deltas of trigrams in a
natural language]). For example, if the data set has 1/3 redundant
elements, then the un-memoized version requires 1/3 more memory,
because it needs to store 1/3 more unique copies of the element,
whereas the memoized version has only to store unique elements once.
Every instance of an element which is already in the cache requires
only the cache lookup (speed), rather than the creation of a new
object (memory). So the trade-off is actually speed for memory rather
than the other way around. Of course, it all depends on the nature of
the data; but for large, redundant data sets, memoization is
definitely a win when it comes to memory optimization.

Regards,
Jordan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-25 Thread Colin J. Williams
Steven D'Aprano wrote:
 On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:
 
 samwyse [EMAIL PROTECTED] writes:

 create a hash that maps your keys to themselves, then use the values of
 that hash as your keys.
 The atom function you describe already exists under the name intern.
 
 Not really. intern() works very differently, because it can tie itself to 
 the Python internals. Samwyse's atom() function doesn't, and so has no 
 purpose.
 
 
 In any case, I'm not sure that intern() actually will solve the OP's 
 problem, even assuming it is a real and not imaginary problem. According 
 to the docs, intern()'s purpose is to speed up dictionary lookups, not to 
 save memory. I suspect that if it does save memory, it will be by 
 accident.
 
From the docs: 
 http://docs.python.org/lib/non-essential-built-in-funcs.html

Yes, but it seems that buffer remains 
with Python 3000.

Colin W.
 
 intern(  string)
 Enter string in the table of ``interned'' strings and return the interned 
 string - which is string itself or a copy. Interning strings is useful to 
 gain a little performance on dictionary lookup - if the keys in a 
 dictionary are interned, and the lookup key is interned, the key 
 comparisons (after hashing) can be done by a pointer compare instead of a 
 string compare. Normally, the names used in Python programs are 
 automatically interned, and the dictionaries used to hold module, class 
 or instance attributes have interned keys. Changed in version 2.3: 
 Interned strings are not immortal (like they used to be in Python 2.2 and 
 before); you must keep a reference to the return value of intern() around 
 to benefit from it.
 
 
 Note the words which is string itself or a copy. It would be ironic if 
 the OP uses intern to avoid having copies of strings, and ends up with 
 even more copies than if he didn't bother.
 
 I guess he'll actually need to measure his memory consumption and see 
 whether he actually has a memory problem or not, right?
 
 
-- 
http://mail.python.org/mailman/listinfo/python-list


How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Licheng Fang
I mean, all the class instances that equal to each other should be
reduced into only one instance, which means for instances of this
class there's no difference between a is b and a==b.

Thank you.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Licheng Fang
I find myself frequently in need of classes like this for two reasons.
First, it's efficient in memory. Second, when two instances are
compared for equality only their pointers are compared. (I think
that's how Python compares 'str's.

On Nov 24, 6:31 pm, Licheng Fang [EMAIL PROTECTED] wrote:
 I mean, all the class instances that equal to each other should be
 reduced into only one instance, which means for instances of this
 class there's no difference between a is b and a==b.

 Thank you.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Bjoern Schliessmann
Licheng Fang wrote:
 I mean, all the class instances that equal to each other should be
 reduced into only one instance, which means for instances of this
 class there's no difference between a is b and a==b.

If you only want that if a == b is True also a is b is True,
overload the is_ attribute of your class. Personally, I don't see
any advantage in this.

Regards,


Björn

-- 
BOFH excuse #352:

The cables are not the same length.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Bjoern Schliessmann
Licheng Fang wrote:
 I find myself frequently in need of classes like this for two
 reasons. First, it's efficient in memory. 

Are you using millions of objects, or MB size objects? Otherwise,
this is no argument.

BTW, what happens if you, by some operation, make a == b, and
afterwards change b so another object instance must be created?
This instance management is quite a runtime overhead.

 Second, when two instances are compared for equality only their
 pointers are compared. 

I state that the object management will often eat more performance
than equality testing. Except you have a huge number of equal
objects. If the latter was the case you should rethink your program
design.

 (I think that's how Python compares 'str's. 

Generally not. In CPython, just very short strings are created only
once.

 a= 
 b= 
 a is b
True
 a=  
 b=  
 a is b
False
 
Regards,


Björn

-- 
BOFH excuse #430:

Mouse has out-of-cheese-error

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Licheng Fang
On Nov 24, 7:05 pm, Bjoern Schliessmann usenet-
[EMAIL PROTECTED] wrote:
 Licheng Fang wrote:
  I find myself frequently in need of classes like this for two
  reasons. First, it's efficient in memory.

 Are you using millions of objects, or MB size objects? Otherwise,
 this is no argument.

Yes, millions. In my natural language processing tasks, I almost
always need to define patterns, identify their occurrences in a huge
data, and count them. Say, I have a big text file, consisting of
millions of words, and I want to count the frequency of trigrams:

trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]

I can save the counts in a dict D1. Later, I may want to recount the
trigrams, with some minor modifications, say, doing it on every other
line of the input file, and the counts are saved in dict D2. Problem
is, D1 and D2 have almost the same set of keys (trigrams of the text),
yet the keys in D2 are new instances, even though these keys probably
have already been inserted into D1. So I end up with unnecessary
duplicates of keys. And this can be a great waste of memory with huge
input data.


 BTW, what happens if you, by some operation, make a == b, and
 afterwards change b so another object instance must be created?
 This instance management is quite a runtime overhead.


I probably need this class to be immutable.

  Second, when two instances are compared for equality only their
  pointers are compared.

 I state that the object management will often eat more performance
 than equality testing. Except you have a huge number of equal
 objects. If the latter was the case you should rethink your program
 design.


Yeah, counting is all about equal or not.

  (I think that's how Python compares 'str's.

 Generally not. In CPython, just very short strings are created only
 once.

  a= 
  b= 
  a is b
 True
  a=  
  b=  
  a is b


Wow, I didn't know this. But exactly how Python manage these strings?
My interpretator gave me such results:

 a = 'this'
 b = 'this'
 a is b
True
 a = 'this is confusing'
 b = 'this is confusing'
 a is b
False


 False

 Regards,

 Björn

 --
 BOFH excuse #430:

 Mouse has out-of-cheese-error

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Bjoern Schliessmann
Licheng Fang wrote:
 On Nov 24, 7:05 pm, Bjoern Schliessmann usenet-

 BTW, what happens if you, by some operation, make a == b, and
 afterwards change b so another object instance must be created?
 This instance management is quite a runtime overhead.
 
 I probably need this class to be immutable.

IMHO you don't need a change of Python, but simply a special
implementation (probably using metaclasses/singletons).
 
 Wow, I didn't know this. But exactly how Python manage these
 strings? 

I don't know (use the source, Luke). :) Or perhaps there is a Python
Elder here that knows?

Regards,


Björn

-- 
BOFH excuse #165:

Backbone Scoliosis

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread samwyse
On Nov 24, 5:44 am, Licheng Fang [EMAIL PROTECTED] wrote:
 Yes, millions. In my natural language processing tasks, I almost
 always need to define patterns, identify their occurrences in a huge
 data, and count them. [...] So I end up with unnecessary
 duplicates of keys. And this can be a great waste of memory with huge
 input data.

create a hash that maps your keys to themselves, then use the values
of that hash as your keys.

 store = {}
 def atom(str):
global store
if str not in store:
store[str] = str
return store[str]

 a='this is confusing'
 b='this is confusing'
 a == b
True
 a is b
False
 atom(a) is atom(b)
True
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Marc 'BlackJack' Rintsch
On Sat, 24 Nov 2007 13:40:40 +0100, Bjoern Schliessmann wrote:

 Licheng Fang wrote:
 On Nov 24, 7:05 pm, Bjoern Schliessmann usenet-
 
 Wow, I didn't know this. But exactly how Python manage these
 strings? 
 
 I don't know (use the source, Luke). :) Or perhaps there is a Python
 Elder here that knows?

AFAIK strings of length 1 and strings that would be valid Python
identifiers are treated this way.

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Licheng Fang
On Nov 24, 9:42 pm, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote:
 On Sat, 24 Nov 2007 13:40:40 +0100, Bjoern Schliessmann wrote:
  Licheng Fang wrote:
  On Nov 24, 7:05 pm, Bjoern Schliessmann usenet-

  Wow, I didn't know this. But exactly how Python manage these
  strings?

  I don't know (use the source, Luke). :) Or perhaps there is a Python
  Elder here that knows?

 AFAIK strings of length 1 and strings that would be valid Python
 identifiers are treated this way.

 Ciao,
 Marc 'BlackJack' Rintsch

Thanks. Then, is there a way to make python treat all strings this
way, or any other kind of immutable objects?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Bruno Desthuilliers
Licheng Fang a écrit :
 I mean, all the class instances that equal to each other should be
 reduced into only one instance, which means for instances of this
 class there's no difference between a is b and a==b.

Here's a QD attempt - without any garantee, and to be taylored to your 
needs.

_values = {}#id(instance) = value mapping
_instances = {} #hash(value) = instance mapping

class Value(object):
   def __new__(cls, value):
 try:
   return _instances[hash(value)]
 except KeyError:
   instance = object.__new__(cls)
   _values[id(instance)] = value
   _instances[hash(value)] = instance
   return instance

   @apply
   def value():
 def fget(self):
   return _values[id(self)]
 def fset(self, ignore):
   raise AttributeError(%s.value is read only % type(self))
 def fdel(self):
   raise AttributeError(%s.value is read only % type(self))
 return property(**locals())


HTH
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread samwyse
On Nov 24, 10:35 am, Licheng Fang [EMAIL PROTECTED] wrote:
 Thanks. Then, is there a way to make python treat all strings this
 way, or any other kind of immutable objects?

The word generally used is 'atom' when referring to strings that are
set up such that 'a == b' implies 'a is b'.  This is usually an
expensive process, since you don't want to do it to strings that are,
e.g., read from a file.  Yes, it could be done only for string
constants, and some languages (starting with LISP) do this, but that
isn't what you (or most people) want.  Whether you realize it or not,
you want control over the process; in your example, you don't want to
do it for the lines read from your file, just the trigrams.

The example that I gave does exactly that.  It adds a fixed amount of
storage for each string that you 'intern' (the usual name given to the
process of generating such a string.  Let's look at my code again:

 store = {}
 def atom(str):
global store
if str not in store:
store[str] = str
return store[str]

Each string passed to 'atom' already exists.  We look to see if copy
already exists; if so we can discard the latest instance and use that
copy henceforth.  If a copy does not exist, we save the string inside
'store'.  Since the string already exists, we're just increasing its
reference count by two (so it won't be reference counted) and
increasing the size of 'store' by (an amortized) pair of pointers to
that same string.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread samwyse
On Nov 24, 5:44 am, Licheng Fang [EMAIL PROTECTED] wrote:
 Yes, millions. In my natural language processing tasks, I almost
 always need to define patterns, identify their occurrences in a huge
 data, and count them. Say, I have a big text file, consisting of
 millions of words, and I want to count the frequency of trigrams:

 trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]

BTW, if the components of your trigrams are never larger than a byte,
then encode the tuples as integers and don't worry about pointer
comparisons.

 def encode(s):
return (ord(s[0])*256+ord(s[1]))*256+ord(s[2])

 def trigram(s):
return [ encode(s[i:i+3]) for i in range(0, len(s)-2)]

 trigram('abcde')
[6382179, 6447972, 6513765]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Steven D'Aprano
On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:

 On Nov 24, 7:05 pm, Bjoern Schliessmann usenet-
 [EMAIL PROTECTED] wrote:
 Licheng Fang wrote:
  I find myself frequently in need of classes like this for two
  reasons. First, it's efficient in memory.

 Are you using millions of objects, or MB size objects? Otherwise, this
 is no argument.
 
 Yes, millions. 


Oh noes!!! Not millions of words That's like, oh, a few tens of 
megabytes1! How will a PC with one or two gigabytes of RAM cope?

Tens of megabytes is not a lot of data.

If the average word size is ten characters, then one million words takes 
ten million bytes, or a little shy of ten megabytes. Even if you are 
using four-byte characters, you've got 40 MB, still a moderate amount of 
data on a modern system.


 In my natural language processing tasks, I almost always
 need to define patterns, identify their occurrences in a huge data, and
 count them. Say, I have a big text file, consisting of millions of
 words, and I want to count the frequency of trigrams:
 
 trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]
 
 I can save the counts in a dict D1. Later, I may want to recount the
 trigrams, with some minor modifications, say, doing it on every other
 line of the input file, and the counts are saved in dict D2. Problem is,
 D1 and D2 have almost the same set of keys (trigrams of the text), yet
 the keys in D2 are new instances, even though these keys probably have
 already been inserted into D1. So I end up with unnecessary duplicates
 of keys. And this can be a great waste of memory with huge input data.

All these keys will almost certainly add up to only a few hundred 
megabytes, which is a reasonable size of data but not excessive. This 
really sounds to me like a case of premature optimization. I think you 
are wasting your time solving a non-problem.



[snip]
 Wow, I didn't know this. But exactly how Python manage these strings? My
 interpretator gave me such results:
 
 a = 'this'
 b = 'this'
 a is b
 True
 a = 'this is confusing'
 b = 'this is confusing'
 a is b
 False


It's an implementation detail. You shouldn't use identity testing unless 
you actually care that two names refer to the same object, not because 
you want to save a few bytes. That's poor design: it's fragile, 
complicated, and defeats the purpose of using a high-level language like 
Python.




-- 
Steven.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Steven D'Aprano
On Sat, 24 Nov 2007 04:54:43 -0800, samwyse wrote:

 create a hash that maps your keys to themselves, then use the values of
 that hash as your keys.
 
 store = {}
 def atom(str):
   global store
   if str not in store:
   store[str] = str
   return store[str]

Oh lordy, that's really made my day! That's the funniest piece of code 
I've seen for a long time! Worthy of being submitted to the DailyWTF.

Samwyse, while I applaud your willingness to help, I think you actually 
need to get some programming skills before doing so. Here's a hint to get 
you started: can you think of a way to optimize that function so it does 
less work?



-- 
Steven.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Steven D'Aprano
On Sat, 24 Nov 2007 12:00:25 +0100, Bjoern Schliessmann wrote:

 Licheng Fang wrote:
 I mean, all the class instances that equal to each other should be
 reduced into only one instance, which means for instances of this class
 there's no difference between a is b and a==b.
 
 If you only want that if a == b is True also a is b is True,
 overload the is_ attribute of your class. Personally, I don't see any
 advantage in this.

No advantage? That's for sure. There is no is_ attribute of generic 
classes, and even if there was, it would have no special meaning.

Identity testing can't be overloaded. If it could, it would no longer be 
identity testing.


-- 
Steven.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread George Sakkis
On Nov 24, 4:59 pm, Steven D'Aprano

[EMAIL PROTECTED] wrote:
 On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:
  On Nov 24, 7:05 pm, Bjoern Schliessmann usenet-
  [EMAIL PROTECTED] wrote:
  Licheng Fang wrote:
   I find myself frequently in need of classes like this for two
   reasons. First, it's efficient in memory.

  Are you using millions of objects, or MB size objects? Otherwise, this
  is no argument.

  Yes, millions.

 Oh noes!!! Not millions of words That's like, oh, a few tens of
 megabytes1! How will a PC with one or two gigabytes of RAM cope?


Comments like these make one wonder if your real life experience with
massive data matches even the one tenth of your self-importance and
need to be snarky in most of your posts.

To the OP: yes, your use case is quite valid; the keyword you are
looking for is memoize. You can find around a dozen of recipes in
the Cookbook and posted in this list; here's one starting point:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413717.

HTH,
George
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Bjoern Schliessmann
Steven D'Aprano wrote:

 No advantage? That's for sure. There is no is_ attribute of
 generic classes, and even if there was, it would have no special
 meaning.

Argl, I confused the operator module's attributes with objects ;)
 
Regards,


Björn

-- 
BOFH excuse #378:

Operators killed by year 2000 bug bite.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Steven D'Aprano
On Sat, 24 Nov 2007 14:58:50 -0800, George Sakkis wrote:

 On Nov 24, 4:59 pm, Steven D'Aprano
 
 [EMAIL PROTECTED] wrote:
 On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:
  On Nov 24, 7:05 pm, Bjoern Schliessmann usenet-
  [EMAIL PROTECTED] wrote:
  Licheng Fang wrote:
   I find myself frequently in need of classes like this for two
   reasons. First, it's efficient in memory.

  Are you using millions of objects, or MB size objects? Otherwise,
  this is no argument.

  Yes, millions.

 Oh noes!!! Not millions of words That's like, oh, a few tens of
 megabytes1! How will a PC with one or two gigabytes of RAM
 cope?


 Comments like these make one wonder if your real life experience with
 massive data matches even the one tenth of your self-importance and need
 to be snarky in most of your posts.

I cheerfully admit to never needing to deal with massive data.

However, I have often needed to deal with tens and hundreds of megabytes 
of data, which IS NOT MASSIVE amounts of data to deal with on modern 
systems. Which was my point. 


 To the OP: yes, your use case is quite valid; the keyword you are
 looking for is memoize. You can find around a dozen of recipes in the
 Cookbook and posted in this list; here's one starting point:
 http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413717.

This has nothing, absolutely NOTHING, to do with memoization. Memoization 
trades off memory for time, allowing slow functions to return results 
faster at the cost of using more memory. The OP wants to save memory, not 
use more of it.



-- 
Steven.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Hrvoje Niksic
samwyse [EMAIL PROTECTED] writes:

 create a hash that maps your keys to themselves, then use the values
 of that hash as your keys.

The atom function you describe already exists under the name
intern.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Steven D'Aprano
On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:

 samwyse [EMAIL PROTECTED] writes:
 
 create a hash that maps your keys to themselves, then use the values of
 that hash as your keys.
 
 The atom function you describe already exists under the name intern.

Not really. intern() works very differently, because it can tie itself to 
the Python internals. Samwyse's atom() function doesn't, and so has no 
purpose.


In any case, I'm not sure that intern() actually will solve the OP's 
problem, even assuming it is a real and not imaginary problem. According 
to the docs, intern()'s purpose is to speed up dictionary lookups, not to 
save memory. I suspect that if it does save memory, it will be by 
accident.

From the docs: 
http://docs.python.org/lib/non-essential-built-in-funcs.html

intern(  string)
Enter string in the table of ``interned'' strings and return the interned 
string - which is string itself or a copy. Interning strings is useful to 
gain a little performance on dictionary lookup - if the keys in a 
dictionary are interned, and the lookup key is interned, the key 
comparisons (after hashing) can be done by a pointer compare instead of a 
string compare. Normally, the names used in Python programs are 
automatically interned, and the dictionaries used to hold module, class 
or instance attributes have interned keys. Changed in version 2.3: 
Interned strings are not immortal (like they used to be in Python 2.2 and 
before); you must keep a reference to the return value of intern() around 
to benefit from it.


Note the words which is string itself or a copy. It would be ironic if 
the OP uses intern to avoid having copies of strings, and ends up with 
even more copies than if he didn't bother.

I guess he'll actually need to measure his memory consumption and see 
whether he actually has a memory problem or not, right?


-- 
Steven.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Hrvoje Niksic
Steven D'Aprano [EMAIL PROTECTED] writes:

 On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:

 samwyse [EMAIL PROTECTED] writes:
 
 create a hash that maps your keys to themselves, then use the values of
 that hash as your keys.
 
 The atom function you describe already exists under the name intern.

 Not really. intern() works very differently, because it can tie itself to 
 the Python internals.

The exact implementation mechanism is subtly different, but
functionally intern is equivalent to the atom function.

 In any case, I'm not sure that intern() actually will solve the OP's
 problem, even assuming it is a real and not imaginary
 problem. According to the docs, intern()'s purpose is to speed up
 dictionary lookups, not to save memory. I suspect that if it does
 save memory, it will be by accident.

It's not by accident, it follows from what interning does.  Interning
speeds up comparisons by returning the same string object for the same
string contents.  If the strings you're working with tend to repeat,
interning will save some memory simply by preventing storage of
multiple copies of the same string.  Whether the savings would make
any difference for the OP is another question.

 From the docs: 
 http://docs.python.org/lib/non-essential-built-in-funcs.html

 intern(  string)
 Enter string in the table of ``interned'' strings and return the interned 
 string - which is string itself or a copy. [...]

 Note the words which is string itself or a copy. It would be ironic if 
 the OP uses intern to avoid having copies of strings, and ends up with 
 even more copies than if he didn't bother.

That's a frequently misunderstood sentence.  It doesn't mean that
intern will make copies; it simply means that the string you get back
from intern can be either the string you passed it or another
(previously interned) string object that is guaranteed to have the
same contents as your string (which makes it technically a copy of
the string you passed to intern).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread George Sakkis
On Nov 24, 7:42 pm, Steven D'Aprano

  To the OP: yes, your use case is quite valid; the keyword you are
  looking for is memoize. You can find around a dozen of recipes in the
  Cookbook and posted in this list; here's one starting point:
 http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413717.

 This has nothing, absolutely NOTHING, to do with memoization. Memoization
 trades off memory for time, allowing slow functions to return results
 faster at the cost of using more memory. The OP wants to save memory, not
 use more of it.

If you bothered to click on that link you would learn that memoization
can be used to save space too and matches OP's case exactly; even the
identity tests work. Self-importance is bad enough by itself, even
without the ignorance, but you seem to do great in both.

George
-- 
http://mail.python.org/mailman/listinfo/python-list