[issue4356] Add "key" argument to "bisect" module functions

2015-03-06 Thread Dmitry Chichkov

Dmitry Chichkov added the comment:

Use case: a custom immutable array with a large number of items and indirect 
key field access. For example ctypes.array, memoryview or ctypes.pointer or any 
other custom container.


1. I'm not sure how anyone can consider a precached key array as a right ans 
scalable answer. It is just a ridiculuos idea. Precashing key array is a O(N) 
operation.  While bisect is O(log(N)).

2. @Raymond There is a statement that "adding 'key()' would encourage bad 
design and steer people after from better solutions."  Well, right now, the 
design that is being encouraged results in O(N) code. Because lazy developers 
are just 'pre-cacaching' (copying) the keys!

3. There is a statement that key() have to be called multiple times per item. 
What?  Why?

4. There is a statement that one can always add a _cmp_ function to an object. 
This is ridiculuous. The object could be immutable. There should be no need to 
modify the object/array when all that you need to do is to bisect it.

--
nosy: +dmtr

___
Python tracker 
<http://bugs.python.org/issue4356>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-14 Thread Dmitry Chichkov

Dmitry Chichkov  added the comment:

Yes, it looks like you are right. And while there is some slight performance 
degradation, at least nothing drastic is happening up to 30M keys. Using your 
modified test:

1000 words ( 961 keys), 3609555 words/s, 19239926 lookups/s, 51 
bytes/key (0.0MB)
   1 words (9042 keys), 3390980 words/s, 18180771 lookups/s, 87 
bytes/key (0.0MB)
  10 words (   83168 keys), 3298809 words/s, 12509481 lookups/s, 37 
bytes/key (3.0MB)
 100 words (  755372 keys), 2477793 words/s, 7537963 lookups/s, 66 
bytes/key (48.0MB)
 500 words ( 3501140 keys), 2291004 words/s, 6487727 lookups/s, 57 
bytes/key (192.0MB)
1000 words ( 6764089 keys), 2238081 words/s, 6216454 lookups/s, 59 
bytes/key (384.0MB)
2000 words (13061821 keys), 2175688 words/s, 5817085 lookups/s, 61 
bytes/key (768.0MB)
5000 words (31188460 keys), 2117751 words/s, 5137209 lookups/s, 51 
bytes/key (1536.0MB)

It looks like internal memory estimates (sys.getsizeof()) are also pretty good:

Before: ['VmPeak:\t10199764 kB', 'VmSize:\t 5552416 kB']
5000 words (31188460 keys), 2117751 words/s, 5137209 lookups/s, 51 
bytes/key (1536.0MB)
After: ['VmPeak:\t10199764 kB', 'VmSize:\t 7125284 kB']
 
7 125 284 kB - 5 552 416 kB = 1 536.00391 MB

--

___
Python tracker 
<http://bugs.python.org/issue9520>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-13 Thread Dmitry Chichkov

Changes by Dmitry Chichkov :


Added file: http://bugs.python.org/file18515/dc.dict.bench.0.02.py

___
Python tracker 
<http://bugs.python.org/issue9520>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-08 Thread Dmitry Chichkov

Dmitry Chichkov  added the comment:

Yes. Data containers optimized for very large datasets, compactness and strict 
adherence to O(1) can be beneficial. 

Python have great high performance containers, but there is a certain lack of 
compact ones. For example, on the x64 machine the following dict() mapping 
10,000,000 very short unicode keys (~7 chars) to integers eats 149 bytes per 
entry. 
>>> import os, re
>>> d = dict()
>>> for i in xrange(0, 1000): d[unicode(i)] = i
>>> print re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' % 
>>> os.getpid()).read())
['VmPeak:\t 1458324 kB', 'VmSize:\t 1458324 kB']

I can understand that there are all kinds of reasons why it is so and even why 
it is good. But having an unobtrusive *compact* container could be nice 
(although you'd be most welcome if you could tweak default containers, so they 
would adjust to the large datasets appropriately),

Also I can't emphasize more that compactness is still important sometimes. 
Modern days datasets are getting larger and larger (literally terabytes) and 
'just add more memory' strategy is not always feasible. 


Regarding the dict() violation of O(1). So far I was unable to reproduce it in 
the test. I can certainly see it on the real dataset, and trust me it was very 
annoying to see ETA 10 hours going down to 8 hours and then gradually up again 
to 17 hours and hanging there. This was _solved_ by switching from dict() to 
Bio.trie(). So this problem certainly had something to do with dict(). I don't 
know what is causing it though.

--

___
Python tracker 
<http://bugs.python.org/issue9520>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-05 Thread Dmitry Chichkov

Dmitry Chichkov  added the comment:

No. I'm not simply running out of system memory. 8Gb/x64/linux. And in my test 
cases I've only seen ~25% of memory utilized. And good idea. I'll try to play 
with the cyclic garbage collector.

It is harder than I thought to make a solid synthetic test case addressing that 
issue. The trouble you need to be able to generate data (e.g. 100,000,000 
words/5,000,000 unique) with a distribution close to that in the real life 
scenario (e.g. word lengths, frequencies and uniqueness in the english text). 
If somebody have a good idea onto how to do it nicely - you'd be very welcome. 

My best shot so far is in the attachment.

--
Added file: http://bugs.python.org/file18406/dc.dict.bench.0.01.py

___
Python tracker 
<http://bugs.python.org/issue9520>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-05 Thread Dmitry Chichkov

Dmitry Chichkov  added the comment:

Thank you for your comment. Perhaps we should try separate this into two issues:
1) Bug. Python's dict() is unusable on datasets with 10,000,000+ keys. Here I 
should provide a solid test case showing a deviation from O(1);

2) Feature request/idea. Add Patricia Trie high performance container.

--

___
Python tracker 
<http://bugs.python.org/issue9520>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container (python's defaultdict(int) is unusable on datasets with 10, 000, 000+ keys.)

2010-08-04 Thread Dmitry Chichkov

New submission from Dmitry Chichkov :

On large data sets (10-100 million keys) the default python dictionary 
implementation fails to meet memory and performance constraints. It also 
apparently fails to keep O(1) complexity (after just 1M keys). As such, there 
is a need for good, optimized, practical implementation of a high-performance 
container that can store such datasets. 

One of the alternatives is a regular Patricia Trie data structure. It can meet 
performance requirements on such datasets. Criteria:
* strictly O(1);
* works well on large datasets (~100M keys); memory efficient;
* same or better performance as dict();
* supports regular dict | counter interface;
* supports unicode keys;
* supports any characters in the keys;
* persistent (can be pickled); 
* in-memory library;

There are a few existing implementations available:
* BioPython trie - http://biopython.org/
* py-radix - http://www.mindrot.org/projects/py-radix/
* PyPi trie - http://pypi.python.org/pypi/trie

A few other relevant alternatives/implementations:
* NIST trie - http://www.itl.nist.gov/div897/sqg/dads/HTML/patriciatree.html
* C++ Trie Library - http://wikipedia-clustering.speedblue.org/trie.php
* PyAvl trie - http://pypi.python.org/pypi/pyavl/1.12_1
* PyJudy tree - http://www.dalkescientific.com/Python/PyJudy.html
* Redis - http://code.google.com/p/redis/
* Memcached - http://memcached.org/

An alternative to a basic Patricia Trie could be some state-of-the-art approach 
from some modern research (i.e. 
http://www.aclweb.org/anthology/W/W09/W09-1505.pdf ), 


The best existing implementation I've been able to find so far is one in the 
BioPython. Compared to defaultdict(int) on the task of counting words. Dataset 
123,981,712 words (6,504,484 unique), 1..21 characters long:
* bio.tree - 459 Mb/0.13 Hours, good O(1) behavior
* defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior

At 8,000, keys python defaultdict(int) starts showing almost O(N) behavior 
and gets unusable with 10,000,000+ unique keys. 



A few usage/installatio notes on BioPython trie:
$ sudo apt-get install python-biopython

>>> from Bio import trie
>>> trieobj = trie.trie()
>>> trieobj["hello"] = 5
>>> trieobj["hello"] += 1
>>> print trieobj["hello"]
>>> print trieobj.keys()

More examples at:
http://python-biopython.sourcearchive.com/documentation/1.54/test__triefind_8py-source.html

--
components: Library (Lib)
messages: 112937
nosy: dmtr
priority: normal
severity: normal
status: open
title: Add Patricia Trie high performance container (python's defaultdict(int) 
is unusable on datasets with 10,000,000+ keys.)
type: performance
versions: Python 3.1, Python 3.2, Python 3.3

___
Python tracker 
<http://bugs.python.org/issue9520>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser

2010-05-02 Thread Dmitry Chichkov

Dmitry Chichkov  added the comment:

Interestingly in precisely these applications often you don't care about 
namespaces at all. Often all you need is to extract 'text' or 'name' elements 
irregardless of the namespace.

--

___
Python tracker 
<http://bugs.python.org/issue8583>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser

2010-05-02 Thread Dmitry Chichkov

Dmitry Chichkov  added the comment:

I agree that the argument name choice is poor. But it have already been made by 
whoever coded the EXPAT parser which cElementTree.XMLParser wraps. So there is 
not much room here.

As to 'proposed feature have to be used with great care by users' - this s 
already taken care of. If you look - cElementTree.XMLParser class is a rather 
obscure one. As I understand it is only being used by users requiring high 
performance xml parsing for large datasets (10GB - 10TB range) in data-mining 
applications.

--

___
Python tracker 
<http://bugs.python.org/issue8583>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser

2010-05-01 Thread Dmitry Chichkov

Dmitry Chichkov  added the comment:

This patch does not modify the existing behavior of the library. The 
namespace_separator parameter is optional. Parameter already exists in the 
EXPAT library, but it is hard coded in the cElementTree.XMLParser code.

Fredrik, yes, namespaces are a fundamental part of the XML information model. 
Yet an option of having them ignored is a very valuable one in the performance 
critical code.

--

___
Python tracker 
<http://bugs.python.org/issue8583>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser

2010-04-30 Thread Dmitry Chichkov

Dmitry Chichkov  added the comment:

And obviously iterparse can be either overridden in the local user code or 
patched in the library. Here's the iterparse code/test code:

import  cElementTree
from cStringIO import StringIO

class iterparse(object):
root = None
def __init__(self, file, events=None, namespace_separator = "}"):
if not hasattr(file, 'read'):
file = open(file, 'rb')
self._file = file
self._events = events
self._namespace_separator = namespace_separator
def __iter__(self):
events = []
b = cElementTree.TreeBuilder()
p = cElementTree.XMLParser(b, namespace_separator= \
self._namespace_separator)
p._setevents(events, self._events)
while 1:
  data = self._file.read(16384)
  if not data:
break
  p.feed(data)
  for event in events:
yield event
  del events[:]
root = p.close()
for event in events:
  yield event
self.root = root


x = """http://www.very_long_url.com";>text"""
context = iterparse(StringIO(x), events=("start", "end", "start-ns"))
for event, elem in context: print event, elem

context = iterparse(StringIO(x), events=("start", "end", "start-ns"), 
namespace_separator = None)
for event, elem in context: print event, elem


It produces:
start-ns ('', 'http://www.very_long_url.com')
start http://www.very_long_url.com}root' at 0xb7ccf650>
start http://www.very_long_url.com}child' at 0xb7ccf5a8>
end http://www.very_long_url.com}child' at 0xb7ccf5a8>
end http://www.very_long_url.com}root' at 0xb7ccf650>
start 
start 
end 
end 

Note the absence of URIs and ignored start-ns events in the 'space_separator = 
None' version.

--

___
Python tracker 
<http://bugs.python.org/issue8583>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser

2010-04-30 Thread Dmitry Chichkov

Changes by Dmitry Chichkov :


--
keywords: +patch
Added file: http://bugs.python.org/file17153/issue-8583.patch

___
Python tracker 
<http://bugs.python.org/issue8583>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8583] Hardcoded namespace_separator in the cElementTree.XMLParser

2010-04-30 Thread Dmitry Chichkov

New submission from Dmitry Chichkov :

The namespace_separator parameter is hard coded in the cElementTree.XMLParser 
class disallowing the option of ignoring XML Namespaces with cElementTree 
library.

Here's the code example:
 from xml.etree.cElementTree import iterparse
 from StringIO import StringIO
 xml = """http://www.very_long_url.com";>"""
 for event, elem in iterparse(StringIO(xml)): print event, elem

It produces:
 end http://www.very_long_url.com}child' at 0xb7ddfa58>
 end http://www.very_long_url.com}root' at 0xb7ddfa40> 

In the current implementation local tags get forcibly concatenated with URIs 
often resulting in the ugly code on the user's side and performance degradation 
(at least due to extra concatenations and extra lengthy compare operations in 
the elements matching code).

Internally cElementTree uses EXPAT parser, which is doing namespace processing 
only optionally, enabled by providing a value for namespace_separator argument. 
This argument is hard-coded in the cElementTree: 
 self->parser = EXPAT(ParserCreate_MM)(encoding, &memory_handler, "}");

Well, attached is a patch exposing this parameter in the 
cElementTree.XMLParser() arguments. This parameter is optional and the default 
behavior should be unchanged.  Here's the test code:

import cElementTree

x = """http://www.very_long_url.com";>text"""

parser = cElementTree.XMLParser()
parser.feed(x)
elem = parser.close()
print elem

parser = cElementTree.XMLParser(namespace_separator="}")
parser.feed(x)
elem = parser.close()
print elem

parser = cElementTree.XMLParser(namespace_separator=None)
parser.feed(x)
elem = parser.close()
print elem

The resulting output:
http://www.very_long_url.com}root' at 0xb7e885f0>
http://www.very_long_url.com}root' at 0xb7e88608>


--
components: Library (Lib)
messages: 104671
nosy: dmtr
priority: normal
severity: normal
status: open
title: Hardcoded namespace_separator in the cElementTree.XMLParser
type: performance
versions: Python 2.5, Python 2.6, Python 2.7

___
Python tracker 
<http://bugs.python.org/issue8583>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8228] pprint, single/multiple items per line parameter

2010-04-01 Thread Dmitry Chichkov

Dmitry Chichkov  added the comment:

Yes. This patch is nowhere near the production level. Unfortunately it works 
for me. And in the moment I don't have time to improve it further. Current 
version doesn't check the item's width upfront, there is definitely room for 
improvement. 

There is also an issue with the included issue_5131.patch - pretty-printed code 
can not be executed, unless 'defaultdict(' ')' type-specs are removed.

--

___
Python tracker 
<http://bugs.python.org/issue8228>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8228] pprint, single/multiple items per line parameter

2010-03-25 Thread Dmitry Chichkov

Dmitry Chichkov  added the comment:

Quick, dirty and utterly incorrect patch that works for me. Includes 
issue_5131.patch (defaultdict support, etc). Targets trunk (2.6), revision 
77310.

--
keywords: +patch
Added file: http://bugs.python.org/file16640/issue_8228.dirty.patch

___
Python tracker 
<http://bugs.python.org/issue8228>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8228] pprint, single/multiple items per line parameter

2010-03-24 Thread Dmitry Chichkov

New submission from Dmitry Chichkov :

I've run into a case where pprint isn't really pretty. 

import pprint
pprint.PrettyPrinter().pprint([1]*100)

Prints a lengthy column of '1'; Not pretty at all. Look:

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1]

--
components: Library (Lib)
messages: 101672
nosy: Dmitry.Chichkov
severity: normal
status: open
title: pprint, single/multiple items per line parameter
type: feature request
versions: Python 2.6

___
Python tracker 
<http://bugs.python.org/issue8228>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com