Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 6:38 pm, Hrvoje Niksic [EMAIL PROTECTED] wrote: samwyse [EMAIL PROTECTED] writes: create a hash that maps your keys to themselves, then use the values of that hash as your keys. The atom function you describe already exists under the name intern. D'oh! That's what I get for not memorizing Non-essential Built-in Functions. In my defense, however, my function will work with anything that can be used as a dictionary key (strings, tuples, frozen sets, etc), not just character strings; thus we return to the original: a=(1,2,3) b=(1,1+1,1+1+1) a == b True a is b False atom(a) is atom(b) True intern(a) is intern(b) Traceback (most recent call last): File pyshell#33, line 1, in ? intern(a) is intern(b) TypeError: intern() argument 1 must be string, not tuple -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 4:44 am, Licheng Fang [EMAIL PROTECTED] wrote: Yes, millions. In my natural language processing tasks, I almost always need to define patterns, identify their occurrences in a huge data, and count them. Say, I have a big text file, consisting of millions of words, and I want to count the frequency of trigrams: I have some experience with this, helping my wife do computational linguistics. (I also have quite a lot of experience with similar things in my day job, which is a decentralized storage grid written in Python.) Unfortunately, Python is not a perfect tool for the job because, as you've learned, Python isn't overly concerned about conserving memory. Each object has substantial overhead associated with it (including each integer, each string, each tuple, ...), and dicts add overhead due to being sparsely filled. You should do measurements yourself to get results for your local CPU and OS, but I found, for example, that storing 20-byte keys and 8-byte values as a Python dict of Python strings took about 100 bytes per entry. Try tokenizing your trigrams by defining a dict from three unigrams to a sequentially allocated integer trigram id (also called a trigram token), and a reverse dict which goes from a trigram id to the three unigrams. Whenever you create a new set of three Python objects representing unigrams, you can pass them through the first mapping to get the trigram id and then free up the original three Python objects. If you do this multiple times, you get multiple references to the same integer object for the trigram id. My wife and I tried this, but it still wasn't compact enough to process her datasets in a mere 4 GiB of RAM. One tool that might help is PyJudy: http://www.dalkescientific.com/Python/PyJudy.html Judy is a delightfully memory-efficient, fast, and flexible data structure. In the specific example of trigram counting (which is also what my wife was doing), you can, for example, assign each to each unigram an integer, and assuming that you have less than two million unigrams you can pack three unigrams into a 64-bit integer... Hm, actually at this point my wife and I stopped using Python and rewrote it in C using JudyTrees. (At the time, PyJudy didn't exist.) If you are interested, please e-mail my wife, Amber Wilcox-O'Hearn and perhaps she'll share the resulting C code with you. Regards, Zooko Wilcox-O'Hearn -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 27, 10:45 am, Steven D'Aprano [EMAIL PROTECTED] wrote: On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote: I mentioned trigram counting as an illustrative case. In fact, you'll often need to define patterns more complex than that, and tens of megabytes of text may generate millions of them, and I've observed they quickly ate up the 8G memory of a workstation in a few minutes. Manipulating these patterns can be tricky, you can easily waste a lot of memory without taking extra care. I just thought if I define my pattern class with this 'atom' property, coding efforts could be easier later. I'm just not getting the same results as you when I try this. I'm finding that with no extra effort at all, it just works. The size of your corpus is not important. Neither is the complexity of how you generate the patterns. What's important is the number of patterns you produce, and millions isn't that huge a number, even without interning the strings. Obviously I'm not running your code, but I can build a dict with millions of patterns, from hundreds of megabytes of text, on a PC with just 1GB of memory and not run into any serious problems. I've just processed roughly 240MB of random emails, generating n-grams up to length 5. The emails include binary attachments and HTML etc., so I'm getting lots of patterns that don't normally exist in natural languages (e.g. 71 occurrences of 'qqq', and 5 of ''). As I said, my PC has only 1GB, and that's being shared with about a dozen other apps (including such memory hogs as Firefox). Results? 64939962 patterns found, of which 17031467 are unique. There's paging, yes, and my PC runs a little slow when I try to switch from one running application to another, but nothing unusable. Opening a dozen YouTube videos at once impacts performance worse. I can't think what you're doing to use up 8GB of RAM for merely millions of strings, even if you are keeping two, three, ten redundant copies. Assuming an average length of twenty bytes per pattern (you said trigrams, but I'd rather over-estimate than under), and even assuming that only half the 8GB are available to Python, you should be able to store something of the order of one hundred million patterns easily: My task is identifying sequences of tokens (phrases) that are possible tranlations of each other from a bilingual corpus. I need to check all the subsequences of a sentence to get the possible phrase pairs. This makes the problem different from n-gram counting in that the number of possible phrases doesn't grow linearly with n, but approximately with n^2. (n being the sentence length.) My first version of the program consumes almost twice as much memory as the current one because I discovered in doing different statistics I was regenerating the patterns, and the counting dictionaries ended up with duplicated pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I can restrict the pattern class to not generate identical instances? So I can avoid such subtle but significant bugs. 4 bytes for a pointer plus 20 bytes for the string = 24 bytes 4*1024**3 / 24 = 178,956,970 (This is a ballpark figure. Real Python strings will have additional overhead.) If you're running into problems with merely millions of patterns, then you're doing worse by probably two orders of magnitude. I don't think that the problem lies where you think it does. If you have a dict with millions of keys, it doesn't matter how many times each pattern exists in the corpus because the key only exists *once* in the dict. Duplicate the dict, or generate it from scratch even, and at worse you double the memory used by the keys. And probably not even that. The only thing I can think of that might explain why you're using so much memory is if you are generating *all* the patterns up front, say in a list, before adding them to the dict: # Generate one massive list of patterns containing many duplicates patterns = make_patterns(text) # returns a massive list like ['fre', 'req', 'equ', 'que' ...] d = {} for pattern in patterns: d[pattern] = d.get(pattern, 0) + 1 No, I wasn't doing that. BTW, do you think the pattern counting code can avoid hashing the pattern twice? Is there a way to do that when the dictionary values are of a primitive type? Notice that the real killer in the above algorithm is that you need enough storage, not just for the unique patterns, but for EVERY separate occurrence of each pattern. Nothing to do with how dicts operate, and everything to do with the algorithm you (hypothetically) are using. If that's what you're doing, then no wonder you're running out of memory. With 200MB of text, you have 209715198 trigrams in your list. The pointers alone will take almost a gigabyte, assuming 32-bit pointers. If this is your approach, interning the strings won't save you. You almost certainly should change to a lazy approach, and use a
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 27, 2007 7:16 AM, Licheng Fang [EMAIL PROTECTED] wrote: On Nov 27, 10:45 am, Steven D'Aprano [EMAIL PROTECTED] wrote: On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote: I mentioned trigram counting as an illustrative case. In fact, you'll often need to define patterns more complex than that, and tens of megabytes of text may generate millions of them, and I've observed they quickly ate up the 8G memory of a workstation in a few minutes. Manipulating these patterns can be tricky, you can easily waste a lot of memory without taking extra care. I just thought if I define my pattern class with this 'atom' property, coding efforts could be easier later. I'm just not getting the same results as you when I try this. I'm finding that with no extra effort at all, it just works. The size of your corpus is not important. Neither is the complexity of how you generate the patterns. What's important is the number of patterns you produce, and millions isn't that huge a number, even without interning the strings. Obviously I'm not running your code, but I can build a dict with millions of patterns, from hundreds of megabytes of text, on a PC with just 1GB of memory and not run into any serious problems. I've just processed roughly 240MB of random emails, generating n-grams up to length 5. The emails include binary attachments and HTML etc., so I'm getting lots of patterns that don't normally exist in natural languages (e.g. 71 occurrences of 'qqq', and 5 of ''). As I said, my PC has only 1GB, and that's being shared with about a dozen other apps (including such memory hogs as Firefox). Results? 64939962 patterns found, of which 17031467 are unique. There's paging, yes, and my PC runs a little slow when I try to switch from one running application to another, but nothing unusable. Opening a dozen YouTube videos at once impacts performance worse. I can't think what you're doing to use up 8GB of RAM for merely millions of strings, even if you are keeping two, three, ten redundant copies. Assuming an average length of twenty bytes per pattern (you said trigrams, but I'd rather over-estimate than under), and even assuming that only half the 8GB are available to Python, you should be able to store something of the order of one hundred million patterns easily: My task is identifying sequences of tokens (phrases) that are possible tranlations of each other from a bilingual corpus. I need to check all the subsequences of a sentence to get the possible phrase pairs. This makes the problem different from n-gram counting in that the number of possible phrases doesn't grow linearly with n, but approximately with n^2. (n being the sentence length.) My first version of the program consumes almost twice as much memory as the current one because I discovered in doing different statistics I was regenerating the patterns, and the counting dictionaries ended up with duplicated pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I can restrict the pattern class to not generate identical instances? So I can avoid such subtle but significant bugs. Implement __hash__ and __eq__ on your pattern class. If the same pattern compares equal and hashes the same then it will be a matching key as far as the dict is concerned and will only be stored once. This is probably cheaper than explicit interning anyway (you don't need to search an intern table). The only thing I can think of that might explain why you're using so much memory is if you are generating *all* the patterns up front, say in a list, before adding them to the dict: # Generate one massive list of patterns containing many duplicates patterns = make_patterns(text) # returns a massive list like ['fre', 'req', 'equ', 'que' ...] d = {} for pattern in patterns: d[pattern] = d.get(pattern, 0) + 1 No, I wasn't doing that. BTW, do you think the pattern counting code can avoid hashing the pattern twice? Is there a way to do that when the dictionary values are of a primitive type? Hashing isn't really an expensive operation. On strings it's even cached on the object. If you implement your own __hash__ method you can do the same, but I wouldn't bother unless you benchmark it and discover that hashing is a bottleneck. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote: I mentioned trigram counting as an illustrative case. In fact, you'll often need to define patterns more complex than that, and tens of megabytes of text may generate millions of them, and I've observed they quickly ate up the 8G memory of a workstation in a few minutes. Manipulating these patterns can be tricky, you can easily waste a lot of memory without taking extra care. I just thought if I define my pattern class with this 'atom' property, coding efforts could be easier later. I'm just not getting the same results as you when I try this. I'm finding that with no extra effort at all, it just works. The size of your corpus is not important. Neither is the complexity of how you generate the patterns. What's important is the number of patterns you produce, and millions isn't that huge a number, even without interning the strings. Obviously I'm not running your code, but I can build a dict with millions of patterns, from hundreds of megabytes of text, on a PC with just 1GB of memory and not run into any serious problems. I've just processed roughly 240MB of random emails, generating n-grams up to length 5. The emails include binary attachments and HTML etc., so I'm getting lots of patterns that don't normally exist in natural languages (e.g. 71 occurrences of 'qqq', and 5 of ''). As I said, my PC has only 1GB, and that's being shared with about a dozen other apps (including such memory hogs as Firefox). Results? 64939962 patterns found, of which 17031467 are unique. There's paging, yes, and my PC runs a little slow when I try to switch from one running application to another, but nothing unusable. Opening a dozen YouTube videos at once impacts performance worse. I can't think what you're doing to use up 8GB of RAM for merely millions of strings, even if you are keeping two, three, ten redundant copies. Assuming an average length of twenty bytes per pattern (you said trigrams, but I'd rather over-estimate than under), and even assuming that only half the 8GB are available to Python, you should be able to store something of the order of one hundred million patterns easily: 4 bytes for a pointer plus 20 bytes for the string = 24 bytes 4*1024**3 / 24 = 178,956,970 (This is a ballpark figure. Real Python strings will have additional overhead.) If you're running into problems with merely millions of patterns, then you're doing worse by probably two orders of magnitude. I don't think that the problem lies where you think it does. If you have a dict with millions of keys, it doesn't matter how many times each pattern exists in the corpus because the key only exists *once* in the dict. Duplicate the dict, or generate it from scratch even, and at worse you double the memory used by the keys. And probably not even that. The only thing I can think of that might explain why you're using so much memory is if you are generating *all* the patterns up front, say in a list, before adding them to the dict: # Generate one massive list of patterns containing many duplicates patterns = make_patterns(text) # returns a massive list like ['fre', 'req', 'equ', 'que' ...] d = {} for pattern in patterns: d[pattern] = d.get(pattern, 0) + 1 Notice that the real killer in the above algorithm is that you need enough storage, not just for the unique patterns, but for EVERY separate occurrence of each pattern. Nothing to do with how dicts operate, and everything to do with the algorithm you (hypothetically) are using. If that's what you're doing, then no wonder you're running out of memory. With 200MB of text, you have 209715198 trigrams in your list. The pointers alone will take almost a gigabyte, assuming 32-bit pointers. If this is your approach, interning the strings won't save you. You almost certainly should change to a lazy approach, and use a generator to make each pattern as needed, then thrown away: def make_patterns(s, n=3): Yield n-grams. if len(s) = n: yield s else: for i in range(len(s)-n+1): yield s[i:i+n] d = {} fp = open('corpus', 'r') for line in fp: for word in line.split(): for pattern in make_patterns(word.strip()): d[pattern] = d.get(pattern, 0) + 1 Obviously I haven't seen your code and don't actually know what you are doing. If my 1GB machine can deal with a dict of 17 million unique keys from a corpus of 240MB with no special memory management, your 8GB machine should too. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
Steven D'Aprano wrote: store = {} def atom(str): global store if str not in store: store[str] = str return store[str] Oh lordy, that's really made my day! That's the funniest piece of code I've seen for a long time! Worthy of being submitted to the DailyWTF. Here's a script to show atom()'s effect on memory footprint: $ cat atom.py import sys data = [1]*1000 items = [] cache = {} if -a in sys.argv: def atom(obj): try: return cache[obj] except KeyError: cache[obj] = obj return obj else: def atom(obj): return obj try: while 1: items.append(atom(tuple(data))) except MemoryError: print len(items) $ ulimit -v 5000 $ python atom.py 226 $ python atom.py -a 185742 So if you are going to submit Sam's function make sure to bundle it with this little demo... Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 25, 5:59 am, Steven D'Aprano [EMAIL PROTECTED] cybersource.com.au wrote: On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote: On Nov 24, 7:05 pm, Bjoern Schliessmann usenet- [EMAIL PROTECTED] wrote: Licheng Fang wrote: I find myself frequently in need of classes like this for two reasons. First, it's efficient in memory. Are you using millions of objects, or MB size objects? Otherwise, this is no argument. Yes, millions. Oh noes!!! Not millions of words That's like, oh, a few tens of megabytes1! How will a PC with one or two gigabytes of RAM cope? Tens of megabytes is not a lot of data. If the average word size is ten characters, then one million words takes ten million bytes, or a little shy of ten megabytes. Even if you are using four-byte characters, you've got 40 MB, still a moderate amount of data on a modern system. I mentioned trigram counting as an illustrative case. In fact, you'll often need to define patterns more complex than that, and tens of megabytes of text may generate millions of them, and I've observed they quickly ate up the 8G memory of a workstation in a few minutes. Manipulating these patterns can be tricky, you can easily waste a lot of memory without taking extra care. I just thought if I define my pattern class with this 'atom' property, coding efforts could be easier later. In my natural language processing tasks, I almost always need to define patterns, identify their occurrences in a huge data, and count them. Say, I have a big text file, consisting of millions of words, and I want to count the frequency of trigrams: trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)] I can save the counts in a dict D1. Later, I may want to recount the trigrams, with some minor modifications, say, doing it on every other line of the input file, and the counts are saved in dict D2. Problem is, D1 and D2 have almost the same set of keys (trigrams of the text), yet the keys in D2 are new instances, even though these keys probably have already been inserted into D1. So I end up with unnecessary duplicates of keys. And this can be a great waste of memory with huge input data. All these keys will almost certainly add up to only a few hundred megabytes, which is a reasonable size of data but not excessive. This really sounds to me like a case of premature optimization. I think you are wasting your time solving a non-problem. [snip] Wow, I didn't know this. But exactly how Python manage these strings? My interpretator gave me such results: a = 'this' b = 'this' a is b True a = 'this is confusing' b = 'this is confusing' a is b False It's an implementation detail. You shouldn't use identity testing unless you actually care that two names refer to the same object, not because you want to save a few bytes. That's poor design: it's fragile, complicated, and defeats the purpose of using a high-level language like Python. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Sun, 25 Nov 2007 10:39:38 +0100, Peter Otten wrote: So if you are going to submit Sam's function make sure to bundle it with this little demo... Well Peter, I was going to reply with a comment about not changing the problem domain (tuples of ints to trigrams from a text file for natural language processing, that is, three character alphanumeric strings), and that if you re-did your test with strings (as I did) you would see absolutely no difference. What I was going to say was Tuples aren't interned. Short strings that look like identifiers are. Jumping through hoops to cache things which are already cached is not productive programming. But then I dug a little deeper, and disassembled the code I was running, and discovered that I was being fooled by the Python compiler's constant- folding, and if I took steps to defeat the optimizer, the effect I was seeing disappeared, and I got the same results as you. Well. So I've learned something new: Python doesn't intern strings in the way I thought it did. I don't quite know *how* it decides which strings to intern and which ones not to, but at least I've learnt that what I thought was true is not true. So I offer my apology to Samwyse, the caching code isn't as redundant and silly as it appears, and humbly tuck into this nice steaming plate of crow. Somebody pass the salt please. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 6:42 pm, Steven D'Aprano [EMAIL PROTECTED] cybersource.com.au wrote: This has nothing, absolutely NOTHING, to do with memoization. Memoization trades off memory for time, allowing slow functions to return results faster at the cost of using more memory. The OP wants to save memory, not use more of it. Not to beat a dead horse, but memoization can significantly minimize memory usage, given a large data set with redundant elements (as the OP seems to suggest [e.g., calculating the deltas of trigrams in a natural language]). For example, if the data set has 1/3 redundant elements, then the un-memoized version requires 1/3 more memory, because it needs to store 1/3 more unique copies of the element, whereas the memoized version has only to store unique elements once. Every instance of an element which is already in the cache requires only the cache lookup (speed), rather than the creation of a new object (memory). So the trade-off is actually speed for memory rather than the other way around. Of course, it all depends on the nature of the data; but for large, redundant data sets, memoization is definitely a win when it comes to memory optimization. Regards, Jordan -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
Steven D'Aprano wrote: On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote: samwyse [EMAIL PROTECTED] writes: create a hash that maps your keys to themselves, then use the values of that hash as your keys. The atom function you describe already exists under the name intern. Not really. intern() works very differently, because it can tie itself to the Python internals. Samwyse's atom() function doesn't, and so has no purpose. In any case, I'm not sure that intern() actually will solve the OP's problem, even assuming it is a real and not imaginary problem. According to the docs, intern()'s purpose is to speed up dictionary lookups, not to save memory. I suspect that if it does save memory, it will be by accident. From the docs: http://docs.python.org/lib/non-essential-built-in-funcs.html Yes, but it seems that buffer remains with Python 3000. Colin W. intern( string) Enter string in the table of ``interned'' strings and return the interned string - which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup - if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys. Changed in version 2.3: Interned strings are not immortal (like they used to be in Python 2.2 and before); you must keep a reference to the return value of intern() around to benefit from it. Note the words which is string itself or a copy. It would be ironic if the OP uses intern to avoid having copies of strings, and ends up with even more copies than if he didn't bother. I guess he'll actually need to measure his memory consumption and see whether he actually has a memory problem or not, right? -- http://mail.python.org/mailman/listinfo/python-list
How can I create customized classes that have similar properties as 'str'?
I mean, all the class instances that equal to each other should be reduced into only one instance, which means for instances of this class there's no difference between a is b and a==b. Thank you. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
I find myself frequently in need of classes like this for two reasons. First, it's efficient in memory. Second, when two instances are compared for equality only their pointers are compared. (I think that's how Python compares 'str's. On Nov 24, 6:31 pm, Licheng Fang [EMAIL PROTECTED] wrote: I mean, all the class instances that equal to each other should be reduced into only one instance, which means for instances of this class there's no difference between a is b and a==b. Thank you. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
Licheng Fang wrote: I mean, all the class instances that equal to each other should be reduced into only one instance, which means for instances of this class there's no difference between a is b and a==b. If you only want that if a == b is True also a is b is True, overload the is_ attribute of your class. Personally, I don't see any advantage in this. Regards, Björn -- BOFH excuse #352: The cables are not the same length. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
Licheng Fang wrote: I find myself frequently in need of classes like this for two reasons. First, it's efficient in memory. Are you using millions of objects, or MB size objects? Otherwise, this is no argument. BTW, what happens if you, by some operation, make a == b, and afterwards change b so another object instance must be created? This instance management is quite a runtime overhead. Second, when two instances are compared for equality only their pointers are compared. I state that the object management will often eat more performance than equality testing. Except you have a huge number of equal objects. If the latter was the case you should rethink your program design. (I think that's how Python compares 'str's. Generally not. In CPython, just very short strings are created only once. a= b= a is b True a= b= a is b False Regards, Björn -- BOFH excuse #430: Mouse has out-of-cheese-error -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 7:05 pm, Bjoern Schliessmann usenet- [EMAIL PROTECTED] wrote: Licheng Fang wrote: I find myself frequently in need of classes like this for two reasons. First, it's efficient in memory. Are you using millions of objects, or MB size objects? Otherwise, this is no argument. Yes, millions. In my natural language processing tasks, I almost always need to define patterns, identify their occurrences in a huge data, and count them. Say, I have a big text file, consisting of millions of words, and I want to count the frequency of trigrams: trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)] I can save the counts in a dict D1. Later, I may want to recount the trigrams, with some minor modifications, say, doing it on every other line of the input file, and the counts are saved in dict D2. Problem is, D1 and D2 have almost the same set of keys (trigrams of the text), yet the keys in D2 are new instances, even though these keys probably have already been inserted into D1. So I end up with unnecessary duplicates of keys. And this can be a great waste of memory with huge input data. BTW, what happens if you, by some operation, make a == b, and afterwards change b so another object instance must be created? This instance management is quite a runtime overhead. I probably need this class to be immutable. Second, when two instances are compared for equality only their pointers are compared. I state that the object management will often eat more performance than equality testing. Except you have a huge number of equal objects. If the latter was the case you should rethink your program design. Yeah, counting is all about equal or not. (I think that's how Python compares 'str's. Generally not. In CPython, just very short strings are created only once. a= b= a is b True a= b= a is b Wow, I didn't know this. But exactly how Python manage these strings? My interpretator gave me such results: a = 'this' b = 'this' a is b True a = 'this is confusing' b = 'this is confusing' a is b False False Regards, Björn -- BOFH excuse #430: Mouse has out-of-cheese-error -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
Licheng Fang wrote: On Nov 24, 7:05 pm, Bjoern Schliessmann usenet- BTW, what happens if you, by some operation, make a == b, and afterwards change b so another object instance must be created? This instance management is quite a runtime overhead. I probably need this class to be immutable. IMHO you don't need a change of Python, but simply a special implementation (probably using metaclasses/singletons). Wow, I didn't know this. But exactly how Python manage these strings? I don't know (use the source, Luke). :) Or perhaps there is a Python Elder here that knows? Regards, Björn -- BOFH excuse #165: Backbone Scoliosis -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 5:44 am, Licheng Fang [EMAIL PROTECTED] wrote: Yes, millions. In my natural language processing tasks, I almost always need to define patterns, identify their occurrences in a huge data, and count them. [...] So I end up with unnecessary duplicates of keys. And this can be a great waste of memory with huge input data. create a hash that maps your keys to themselves, then use the values of that hash as your keys. store = {} def atom(str): global store if str not in store: store[str] = str return store[str] a='this is confusing' b='this is confusing' a == b True a is b False atom(a) is atom(b) True -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Sat, 24 Nov 2007 13:40:40 +0100, Bjoern Schliessmann wrote: Licheng Fang wrote: On Nov 24, 7:05 pm, Bjoern Schliessmann usenet- Wow, I didn't know this. But exactly how Python manage these strings? I don't know (use the source, Luke). :) Or perhaps there is a Python Elder here that knows? AFAIK strings of length 1 and strings that would be valid Python identifiers are treated this way. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 9:42 pm, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote: On Sat, 24 Nov 2007 13:40:40 +0100, Bjoern Schliessmann wrote: Licheng Fang wrote: On Nov 24, 7:05 pm, Bjoern Schliessmann usenet- Wow, I didn't know this. But exactly how Python manage these strings? I don't know (use the source, Luke). :) Or perhaps there is a Python Elder here that knows? AFAIK strings of length 1 and strings that would be valid Python identifiers are treated this way. Ciao, Marc 'BlackJack' Rintsch Thanks. Then, is there a way to make python treat all strings this way, or any other kind of immutable objects? -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
Licheng Fang a écrit : I mean, all the class instances that equal to each other should be reduced into only one instance, which means for instances of this class there's no difference between a is b and a==b. Here's a QD attempt - without any garantee, and to be taylored to your needs. _values = {}#id(instance) = value mapping _instances = {} #hash(value) = instance mapping class Value(object): def __new__(cls, value): try: return _instances[hash(value)] except KeyError: instance = object.__new__(cls) _values[id(instance)] = value _instances[hash(value)] = instance return instance @apply def value(): def fget(self): return _values[id(self)] def fset(self, ignore): raise AttributeError(%s.value is read only % type(self)) def fdel(self): raise AttributeError(%s.value is read only % type(self)) return property(**locals()) HTH -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 10:35 am, Licheng Fang [EMAIL PROTECTED] wrote: Thanks. Then, is there a way to make python treat all strings this way, or any other kind of immutable objects? The word generally used is 'atom' when referring to strings that are set up such that 'a == b' implies 'a is b'. This is usually an expensive process, since you don't want to do it to strings that are, e.g., read from a file. Yes, it could be done only for string constants, and some languages (starting with LISP) do this, but that isn't what you (or most people) want. Whether you realize it or not, you want control over the process; in your example, you don't want to do it for the lines read from your file, just the trigrams. The example that I gave does exactly that. It adds a fixed amount of storage for each string that you 'intern' (the usual name given to the process of generating such a string. Let's look at my code again: store = {} def atom(str): global store if str not in store: store[str] = str return store[str] Each string passed to 'atom' already exists. We look to see if copy already exists; if so we can discard the latest instance and use that copy henceforth. If a copy does not exist, we save the string inside 'store'. Since the string already exists, we're just increasing its reference count by two (so it won't be reference counted) and increasing the size of 'store' by (an amortized) pair of pointers to that same string. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 5:44 am, Licheng Fang [EMAIL PROTECTED] wrote: Yes, millions. In my natural language processing tasks, I almost always need to define patterns, identify their occurrences in a huge data, and count them. Say, I have a big text file, consisting of millions of words, and I want to count the frequency of trigrams: trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)] BTW, if the components of your trigrams are never larger than a byte, then encode the tuples as integers and don't worry about pointer comparisons. def encode(s): return (ord(s[0])*256+ord(s[1]))*256+ord(s[2]) def trigram(s): return [ encode(s[i:i+3]) for i in range(0, len(s)-2)] trigram('abcde') [6382179, 6447972, 6513765] -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote: On Nov 24, 7:05 pm, Bjoern Schliessmann usenet- [EMAIL PROTECTED] wrote: Licheng Fang wrote: I find myself frequently in need of classes like this for two reasons. First, it's efficient in memory. Are you using millions of objects, or MB size objects? Otherwise, this is no argument. Yes, millions. Oh noes!!! Not millions of words That's like, oh, a few tens of megabytes1! How will a PC with one or two gigabytes of RAM cope? Tens of megabytes is not a lot of data. If the average word size is ten characters, then one million words takes ten million bytes, or a little shy of ten megabytes. Even if you are using four-byte characters, you've got 40 MB, still a moderate amount of data on a modern system. In my natural language processing tasks, I almost always need to define patterns, identify their occurrences in a huge data, and count them. Say, I have a big text file, consisting of millions of words, and I want to count the frequency of trigrams: trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)] I can save the counts in a dict D1. Later, I may want to recount the trigrams, with some minor modifications, say, doing it on every other line of the input file, and the counts are saved in dict D2. Problem is, D1 and D2 have almost the same set of keys (trigrams of the text), yet the keys in D2 are new instances, even though these keys probably have already been inserted into D1. So I end up with unnecessary duplicates of keys. And this can be a great waste of memory with huge input data. All these keys will almost certainly add up to only a few hundred megabytes, which is a reasonable size of data but not excessive. This really sounds to me like a case of premature optimization. I think you are wasting your time solving a non-problem. [snip] Wow, I didn't know this. But exactly how Python manage these strings? My interpretator gave me such results: a = 'this' b = 'this' a is b True a = 'this is confusing' b = 'this is confusing' a is b False It's an implementation detail. You shouldn't use identity testing unless you actually care that two names refer to the same object, not because you want to save a few bytes. That's poor design: it's fragile, complicated, and defeats the purpose of using a high-level language like Python. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Sat, 24 Nov 2007 04:54:43 -0800, samwyse wrote: create a hash that maps your keys to themselves, then use the values of that hash as your keys. store = {} def atom(str): global store if str not in store: store[str] = str return store[str] Oh lordy, that's really made my day! That's the funniest piece of code I've seen for a long time! Worthy of being submitted to the DailyWTF. Samwyse, while I applaud your willingness to help, I think you actually need to get some programming skills before doing so. Here's a hint to get you started: can you think of a way to optimize that function so it does less work? -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Sat, 24 Nov 2007 12:00:25 +0100, Bjoern Schliessmann wrote: Licheng Fang wrote: I mean, all the class instances that equal to each other should be reduced into only one instance, which means for instances of this class there's no difference between a is b and a==b. If you only want that if a == b is True also a is b is True, overload the is_ attribute of your class. Personally, I don't see any advantage in this. No advantage? That's for sure. There is no is_ attribute of generic classes, and even if there was, it would have no special meaning. Identity testing can't be overloaded. If it could, it would no longer be identity testing. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 4:59 pm, Steven D'Aprano [EMAIL PROTECTED] wrote: On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote: On Nov 24, 7:05 pm, Bjoern Schliessmann usenet- [EMAIL PROTECTED] wrote: Licheng Fang wrote: I find myself frequently in need of classes like this for two reasons. First, it's efficient in memory. Are you using millions of objects, or MB size objects? Otherwise, this is no argument. Yes, millions. Oh noes!!! Not millions of words That's like, oh, a few tens of megabytes1! How will a PC with one or two gigabytes of RAM cope? Comments like these make one wonder if your real life experience with massive data matches even the one tenth of your self-importance and need to be snarky in most of your posts. To the OP: yes, your use case is quite valid; the keyword you are looking for is memoize. You can find around a dozen of recipes in the Cookbook and posted in this list; here's one starting point: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413717. HTH, George -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
Steven D'Aprano wrote: No advantage? That's for sure. There is no is_ attribute of generic classes, and even if there was, it would have no special meaning. Argl, I confused the operator module's attributes with objects ;) Regards, Björn -- BOFH excuse #378: Operators killed by year 2000 bug bite. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Sat, 24 Nov 2007 14:58:50 -0800, George Sakkis wrote: On Nov 24, 4:59 pm, Steven D'Aprano [EMAIL PROTECTED] wrote: On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote: On Nov 24, 7:05 pm, Bjoern Schliessmann usenet- [EMAIL PROTECTED] wrote: Licheng Fang wrote: I find myself frequently in need of classes like this for two reasons. First, it's efficient in memory. Are you using millions of objects, or MB size objects? Otherwise, this is no argument. Yes, millions. Oh noes!!! Not millions of words That's like, oh, a few tens of megabytes1! How will a PC with one or two gigabytes of RAM cope? Comments like these make one wonder if your real life experience with massive data matches even the one tenth of your self-importance and need to be snarky in most of your posts. I cheerfully admit to never needing to deal with massive data. However, I have often needed to deal with tens and hundreds of megabytes of data, which IS NOT MASSIVE amounts of data to deal with on modern systems. Which was my point. To the OP: yes, your use case is quite valid; the keyword you are looking for is memoize. You can find around a dozen of recipes in the Cookbook and posted in this list; here's one starting point: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413717. This has nothing, absolutely NOTHING, to do with memoization. Memoization trades off memory for time, allowing slow functions to return results faster at the cost of using more memory. The OP wants to save memory, not use more of it. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
samwyse [EMAIL PROTECTED] writes: create a hash that maps your keys to themselves, then use the values of that hash as your keys. The atom function you describe already exists under the name intern. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote: samwyse [EMAIL PROTECTED] writes: create a hash that maps your keys to themselves, then use the values of that hash as your keys. The atom function you describe already exists under the name intern. Not really. intern() works very differently, because it can tie itself to the Python internals. Samwyse's atom() function doesn't, and so has no purpose. In any case, I'm not sure that intern() actually will solve the OP's problem, even assuming it is a real and not imaginary problem. According to the docs, intern()'s purpose is to speed up dictionary lookups, not to save memory. I suspect that if it does save memory, it will be by accident. From the docs: http://docs.python.org/lib/non-essential-built-in-funcs.html intern( string) Enter string in the table of ``interned'' strings and return the interned string - which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup - if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys. Changed in version 2.3: Interned strings are not immortal (like they used to be in Python 2.2 and before); you must keep a reference to the return value of intern() around to benefit from it. Note the words which is string itself or a copy. It would be ironic if the OP uses intern to avoid having copies of strings, and ends up with even more copies than if he didn't bother. I guess he'll actually need to measure his memory consumption and see whether he actually has a memory problem or not, right? -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
Steven D'Aprano [EMAIL PROTECTED] writes: On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote: samwyse [EMAIL PROTECTED] writes: create a hash that maps your keys to themselves, then use the values of that hash as your keys. The atom function you describe already exists under the name intern. Not really. intern() works very differently, because it can tie itself to the Python internals. The exact implementation mechanism is subtly different, but functionally intern is equivalent to the atom function. In any case, I'm not sure that intern() actually will solve the OP's problem, even assuming it is a real and not imaginary problem. According to the docs, intern()'s purpose is to speed up dictionary lookups, not to save memory. I suspect that if it does save memory, it will be by accident. It's not by accident, it follows from what interning does. Interning speeds up comparisons by returning the same string object for the same string contents. If the strings you're working with tend to repeat, interning will save some memory simply by preventing storage of multiple copies of the same string. Whether the savings would make any difference for the OP is another question. From the docs: http://docs.python.org/lib/non-essential-built-in-funcs.html intern( string) Enter string in the table of ``interned'' strings and return the interned string - which is string itself or a copy. [...] Note the words which is string itself or a copy. It would be ironic if the OP uses intern to avoid having copies of strings, and ends up with even more copies than if he didn't bother. That's a frequently misunderstood sentence. It doesn't mean that intern will make copies; it simply means that the string you get back from intern can be either the string you passed it or another (previously interned) string object that is guaranteed to have the same contents as your string (which makes it technically a copy of the string you passed to intern). -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 7:42 pm, Steven D'Aprano To the OP: yes, your use case is quite valid; the keyword you are looking for is memoize. You can find around a dozen of recipes in the Cookbook and posted in this list; here's one starting point: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413717. This has nothing, absolutely NOTHING, to do with memoization. Memoization trades off memory for time, allowing slow functions to return results faster at the cost of using more memory. The OP wants to save memory, not use more of it. If you bothered to click on that link you would learn that memoization can be used to save space too and matches OP's case exactly; even the identity tests work. Self-importance is bad enough by itself, even without the ignorance, but you seem to do great in both. George -- http://mail.python.org/mailman/listinfo/python-list