[issue4356] Add "key" argument to "bisect" module functions
Dmitry Chichkov added the comment: Use case: a custom immutable array with a large number of items and indirect key field access. For example ctypes.array, memoryview or ctypes.pointer or any other custom container. 1. I'm not sure how anyone can consider a precached key array as a right ans scalable answer. It is just a ridiculuos idea. Precashing key array is a O(N) operation. While bisect is O(log(N)). 2. @Raymond There is a statement that "adding 'key()' would encourage bad design and steer people after from better solutions." Well, right now, the design that is being encouraged results in O(N) code. Because lazy developers are just 'pre-cacaching' (copying) the keys! 3. There is a statement that key() have to be called multiple times per item. What? Why? 4. There is a statement that one can always add a _cmp_ function to an object. This is ridiculuous. The object could be immutable. There should be no need to modify the object/array when all that you need to do is to bisect it. -- nosy: +dmtr ___ Python tracker <http://bugs.python.org/issue4356> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9520] Add Patricia Trie high performance container
Dmitry Chichkov added the comment: Yes, it looks like you are right. And while there is some slight performance degradation, at least nothing drastic is happening up to 30M keys. Using your modified test: 1000 words ( 961 keys), 3609555 words/s, 19239926 lookups/s, 51 bytes/key (0.0MB) 1 words (9042 keys), 3390980 words/s, 18180771 lookups/s, 87 bytes/key (0.0MB) 10 words ( 83168 keys), 3298809 words/s, 12509481 lookups/s, 37 bytes/key (3.0MB) 100 words ( 755372 keys), 2477793 words/s, 7537963 lookups/s, 66 bytes/key (48.0MB) 500 words ( 3501140 keys), 2291004 words/s, 6487727 lookups/s, 57 bytes/key (192.0MB) 1000 words ( 6764089 keys), 2238081 words/s, 6216454 lookups/s, 59 bytes/key (384.0MB) 2000 words (13061821 keys), 2175688 words/s, 5817085 lookups/s, 61 bytes/key (768.0MB) 5000 words (31188460 keys), 2117751 words/s, 5137209 lookups/s, 51 bytes/key (1536.0MB) It looks like internal memory estimates (sys.getsizeof()) are also pretty good: Before: ['VmPeak:\t10199764 kB', 'VmSize:\t 5552416 kB'] 5000 words (31188460 keys), 2117751 words/s, 5137209 lookups/s, 51 bytes/key (1536.0MB) After: ['VmPeak:\t10199764 kB', 'VmSize:\t 7125284 kB'] 7 125 284 kB - 5 552 416 kB = 1 536.00391 MB -- ___ Python tracker <http://bugs.python.org/issue9520> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9520] Add Patricia Trie high performance container
Changes by Dmitry Chichkov : Added file: http://bugs.python.org/file18515/dc.dict.bench.0.02.py ___ Python tracker <http://bugs.python.org/issue9520> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9520] Add Patricia Trie high performance container
Dmitry Chichkov added the comment: Yes. Data containers optimized for very large datasets, compactness and strict adherence to O(1) can be beneficial. Python have great high performance containers, but there is a certain lack of compact ones. For example, on the x64 machine the following dict() mapping 10,000,000 very short unicode keys (~7 chars) to integers eats 149 bytes per entry. >>> import os, re >>> d = dict() >>> for i in xrange(0, 1000): d[unicode(i)] = i >>> print re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' % >>> os.getpid()).read()) ['VmPeak:\t 1458324 kB', 'VmSize:\t 1458324 kB'] I can understand that there are all kinds of reasons why it is so and even why it is good. But having an unobtrusive *compact* container could be nice (although you'd be most welcome if you could tweak default containers, so they would adjust to the large datasets appropriately), Also I can't emphasize more that compactness is still important sometimes. Modern days datasets are getting larger and larger (literally terabytes) and 'just add more memory' strategy is not always feasible. Regarding the dict() violation of O(1). So far I was unable to reproduce it in the test. I can certainly see it on the real dataset, and trust me it was very annoying to see ETA 10 hours going down to 8 hours and then gradually up again to 17 hours and hanging there. This was _solved_ by switching from dict() to Bio.trie(). So this problem certainly had something to do with dict(). I don't know what is causing it though. -- ___ Python tracker <http://bugs.python.org/issue9520> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9520] Add Patricia Trie high performance container
Dmitry Chichkov added the comment: No. I'm not simply running out of system memory. 8Gb/x64/linux. And in my test cases I've only seen ~25% of memory utilized. And good idea. I'll try to play with the cyclic garbage collector. It is harder than I thought to make a solid synthetic test case addressing that issue. The trouble you need to be able to generate data (e.g. 100,000,000 words/5,000,000 unique) with a distribution close to that in the real life scenario (e.g. word lengths, frequencies and uniqueness in the english text). If somebody have a good idea onto how to do it nicely - you'd be very welcome. My best shot so far is in the attachment. -- Added file: http://bugs.python.org/file18406/dc.dict.bench.0.01.py ___ Python tracker <http://bugs.python.org/issue9520> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9520] Add Patricia Trie high performance container
Dmitry Chichkov added the comment: Thank you for your comment. Perhaps we should try separate this into two issues: 1) Bug. Python's dict() is unusable on datasets with 10,000,000+ keys. Here I should provide a solid test case showing a deviation from O(1); 2) Feature request/idea. Add Patricia Trie high performance container. -- ___ Python tracker <http://bugs.python.org/issue9520> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9520] Add Patricia Trie high performance container (python's defaultdict(int) is unusable on datasets with 10, 000, 000+ keys.)
New submission from Dmitry Chichkov : On large data sets (10-100 million keys) the default python dictionary implementation fails to meet memory and performance constraints. It also apparently fails to keep O(1) complexity (after just 1M keys). As such, there is a need for good, optimized, practical implementation of a high-performance container that can store such datasets. One of the alternatives is a regular Patricia Trie data structure. It can meet performance requirements on such datasets. Criteria: * strictly O(1); * works well on large datasets (~100M keys); memory efficient; * same or better performance as dict(); * supports regular dict | counter interface; * supports unicode keys; * supports any characters in the keys; * persistent (can be pickled); * in-memory library; There are a few existing implementations available: * BioPython trie - http://biopython.org/ * py-radix - http://www.mindrot.org/projects/py-radix/ * PyPi trie - http://pypi.python.org/pypi/trie A few other relevant alternatives/implementations: * NIST trie - http://www.itl.nist.gov/div897/sqg/dads/HTML/patriciatree.html * C++ Trie Library - http://wikipedia-clustering.speedblue.org/trie.php * PyAvl trie - http://pypi.python.org/pypi/pyavl/1.12_1 * PyJudy tree - http://www.dalkescientific.com/Python/PyJudy.html * Redis - http://code.google.com/p/redis/ * Memcached - http://memcached.org/ An alternative to a basic Patricia Trie could be some state-of-the-art approach from some modern research (i.e. http://www.aclweb.org/anthology/W/W09/W09-1505.pdf ), The best existing implementation I've been able to find so far is one in the BioPython. Compared to defaultdict(int) on the task of counting words. Dataset 123,981,712 words (6,504,484 unique), 1..21 characters long: * bio.tree - 459 Mb/0.13 Hours, good O(1) behavior * defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior At 8,000, keys python defaultdict(int) starts showing almost O(N) behavior and gets unusable with 10,000,000+ unique keys. A few usage/installatio notes on BioPython trie: $ sudo apt-get install python-biopython >>> from Bio import trie >>> trieobj = trie.trie() >>> trieobj["hello"] = 5 >>> trieobj["hello"] += 1 >>> print trieobj["hello"] >>> print trieobj.keys() More examples at: http://python-biopython.sourcearchive.com/documentation/1.54/test__triefind_8py-source.html -- components: Library (Lib) messages: 112937 nosy: dmtr priority: normal severity: normal status: open title: Add Patricia Trie high performance container (python's defaultdict(int) is unusable on datasets with 10,000,000+ keys.) type: performance versions: Python 3.1, Python 3.2, Python 3.3 ___ Python tracker <http://bugs.python.org/issue9520> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser
Dmitry Chichkov added the comment: Interestingly in precisely these applications often you don't care about namespaces at all. Often all you need is to extract 'text' or 'name' elements irregardless of the namespace. -- ___ Python tracker <http://bugs.python.org/issue8583> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser
Dmitry Chichkov added the comment: I agree that the argument name choice is poor. But it have already been made by whoever coded the EXPAT parser which cElementTree.XMLParser wraps. So there is not much room here. As to 'proposed feature have to be used with great care by users' - this s already taken care of. If you look - cElementTree.XMLParser class is a rather obscure one. As I understand it is only being used by users requiring high performance xml parsing for large datasets (10GB - 10TB range) in data-mining applications. -- ___ Python tracker <http://bugs.python.org/issue8583> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser
Dmitry Chichkov added the comment: This patch does not modify the existing behavior of the library. The namespace_separator parameter is optional. Parameter already exists in the EXPAT library, but it is hard coded in the cElementTree.XMLParser code. Fredrik, yes, namespaces are a fundamental part of the XML information model. Yet an option of having them ignored is a very valuable one in the performance critical code. -- ___ Python tracker <http://bugs.python.org/issue8583> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser
Dmitry Chichkov added the comment: And obviously iterparse can be either overridden in the local user code or patched in the library. Here's the iterparse code/test code: import cElementTree from cStringIO import StringIO class iterparse(object): root = None def __init__(self, file, events=None, namespace_separator = "}"): if not hasattr(file, 'read'): file = open(file, 'rb') self._file = file self._events = events self._namespace_separator = namespace_separator def __iter__(self): events = [] b = cElementTree.TreeBuilder() p = cElementTree.XMLParser(b, namespace_separator= \ self._namespace_separator) p._setevents(events, self._events) while 1: data = self._file.read(16384) if not data: break p.feed(data) for event in events: yield event del events[:] root = p.close() for event in events: yield event self.root = root x = """http://www.very_long_url.com";>text""" context = iterparse(StringIO(x), events=("start", "end", "start-ns")) for event, elem in context: print event, elem context = iterparse(StringIO(x), events=("start", "end", "start-ns"), namespace_separator = None) for event, elem in context: print event, elem It produces: start-ns ('', 'http://www.very_long_url.com') start http://www.very_long_url.com}root' at 0xb7ccf650> start http://www.very_long_url.com}child' at 0xb7ccf5a8> end http://www.very_long_url.com}child' at 0xb7ccf5a8> end http://www.very_long_url.com}root' at 0xb7ccf650> start start end end Note the absence of URIs and ignored start-ns events in the 'space_separator = None' version. -- ___ Python tracker <http://bugs.python.org/issue8583> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser
Changes by Dmitry Chichkov : -- keywords: +patch Added file: http://bugs.python.org/file17153/issue-8583.patch ___ Python tracker <http://bugs.python.org/issue8583> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser
New submission from Dmitry Chichkov : The namespace_separator parameter is hard coded in the cElementTree.XMLParser class disallowing the option of ignoring XML Namespaces with cElementTree library. Here's the code example: from xml.etree.cElementTree import iterparse from StringIO import StringIO xml = """http://www.very_long_url.com";>""" for event, elem in iterparse(StringIO(xml)): print event, elem It produces: end http://www.very_long_url.com}child' at 0xb7ddfa58> end http://www.very_long_url.com}root' at 0xb7ddfa40> In the current implementation local tags get forcibly concatenated with URIs often resulting in the ugly code on the user's side and performance degradation (at least due to extra concatenations and extra lengthy compare operations in the elements matching code). Internally cElementTree uses EXPAT parser, which is doing namespace processing only optionally, enabled by providing a value for namespace_separator argument. This argument is hard-coded in the cElementTree: self->parser = EXPAT(ParserCreate_MM)(encoding, &memory_handler, "}"); Well, attached is a patch exposing this parameter in the cElementTree.XMLParser() arguments. This parameter is optional and the default behavior should be unchanged. Here's the test code: import cElementTree x = """http://www.very_long_url.com";>text""" parser = cElementTree.XMLParser() parser.feed(x) elem = parser.close() print elem parser = cElementTree.XMLParser(namespace_separator="}") parser.feed(x) elem = parser.close() print elem parser = cElementTree.XMLParser(namespace_separator=None) parser.feed(x) elem = parser.close() print elem The resulting output: http://www.very_long_url.com}root' at 0xb7e885f0> http://www.very_long_url.com}root' at 0xb7e88608> -- components: Library (Lib) messages: 104671 nosy: dmtr priority: normal severity: normal status: open title: Hardcoded namespace_separator in the cElementTree.XMLParser type: performance versions: Python 2.5, Python 2.6, Python 2.7 ___ Python tracker <http://bugs.python.org/issue8583> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8228] pprint, single/multiple items per line parameter
Dmitry Chichkov added the comment: Yes. This patch is nowhere near the production level. Unfortunately it works for me. And in the moment I don't have time to improve it further. Current version doesn't check the item's width upfront, there is definitely room for improvement. There is also an issue with the included issue_5131.patch - pretty-printed code can not be executed, unless 'defaultdict(' ')' type-specs are removed. -- ___ Python tracker <http://bugs.python.org/issue8228> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8228] pprint, single/multiple items per line parameter
Dmitry Chichkov added the comment: Quick, dirty and utterly incorrect patch that works for me. Includes issue_5131.patch (defaultdict support, etc). Targets trunk (2.6), revision 77310. -- keywords: +patch Added file: http://bugs.python.org/file16640/issue_8228.dirty.patch ___ Python tracker <http://bugs.python.org/issue8228> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8228] pprint, single/multiple items per line parameter
New submission from Dmitry Chichkov : I've run into a case where pprint isn't really pretty. import pprint pprint.PrettyPrinter().pprint([1]*100) Prints a lengthy column of '1'; Not pretty at all. Look: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] -- components: Library (Lib) messages: 101672 nosy: Dmitry.Chichkov severity: normal status: open title: pprint, single/multiple items per line parameter type: feature request versions: Python 2.6 ___ Python tracker <http://bugs.python.org/issue8228> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com