Hi Benoit,

My Python is behaving differently (see below). If I understand
correctly, you are saying that a variable affects an operation of a
function of a different scope - but I find it difficult to imagine.
How is it possible? (other than consuming all RAM available and
forcing Python to use swap).

Note, when I tried the first test, I noticed mysql running full speed
even if I closed my ipython session. I had to restart the mysql
server. Could it be that zombie mysql connections consume almost all
of your RAM and force swapping after "i=range(5000000)"?


In [1]: def run_sql(no):
   ...:     return [(i,) for i in xrange(no)]
   ...:

In [2]: def run_sql2(no):
    return [(i,) for i in range(no)]
   ...:

In [4]: def run_sql3():
    return [(i,) for i in range(1000000)]
   ...:

In [6]: %time res = run_sql(1000000)
CPU times: user 0.11 s, sys: 0.11 s, total: 0.22 s
Wall time: 0.22 s

In [8]: %time res = run_sql2(1000000)
CPU times: user 0.21 s, sys: 0.13 s, total: 0.34 s
Wall time: 0.34 s

In [10]: %time res = run_sql3()
CPU times: user 0.28 s, sys: 0.01 s, total: 0.29 s
Wall time: 0.29 s

In [12]: a = range(50000000)

In [13]: %time res = run_sql(1000000)
CPU times: user 2.25 s, sys: 0.04 s, total: 2.29 s
Wall time: 2.29 s

In [15]: %time res = run_sql2(1000000)
CPU times: user 0.28 s, sys: 0.00 s, total: 0.28 s
Wall time: 0.28 s

In [17]: %time res = run_sql3()
CPU times: user 0.27 s, sys: 0.00 s, total: 0.27 s
Wall time: 0.27 s

In [19]: %timeit res=run_sql(1000000)
10 loops, best of 3: 135 ms per loop

In [20]: %timeit res=run_sql2(1000000)
10 loops, best of 3: 176 ms per loop

In [21]: %timeit res=run_sql3()
10 loops, best of 3: 175 ms per loop



On Fri, Dec 23, 2011 at 12:01 AM, Benoit Thiell <bthi...@cfa.harvard.edu> wrote:
> Hi Roman.
>
> Thanks for your test. Here are the latest news:
>
> I tried with yet another MySQL bridge (https://launchpad.net/myconnpy)
> and got very similar results. This makes me think that it is not a
> problem that is specific to a Python MySQL bridge but maybe something
> lower level. So, on an advice from Alberto, I ran the following test
> which emulates a SQL request and returns a list of 1M tuples
> containing incrementing integers:
>
>    In [1]: def run_sql():
>       ...:     return [(i,) for i in range(1000000)]
>
>    In [2]: %time res = run_sql()
>    CPU times: user 0.94 s, sys: 0.05 s, total: 0.99 s
>    Wall time: 0.99 s
>
>    In [3]: i = range(50000000)
>
>    In [4]: %time res = run_sql()
>    CPU times: user 10.33 s, sys: 0.04 s, total: 10.37 s
>    Wall time: 10.37 s
>
>    In [5]: i = range(50000000)
>
>    In [6]: %time res = run_sql()
>    CPU times: user 11.41 s, sys: 0.00 s, total: 11.41 s
>    Wall time: 11.41 s
>
> You can see that the creation of a data structure similar to the one
> returned by run_sql() takes significantly longer when a big list is
> already in memory. I don't quite understand why the third "SQL
> request" takes not much longer than the second.
>
> This is a very serious problem for us because it means that any SQL
> request will take a long time to return just because we have two
> gigantic citation dictionaries loaded in memory. So either we find a
> way to prevent this problem, or we need to find a way to not load the
> dictionaries in memory.
>
> Benoit.
>
> On Thu, Dec 22, 2011 at 4:00 PM, Roman Chyla <roman.ch...@gmail.com> wrote:
>> Hi,
>> I have tried the same - could not reproduce with 1M rows, but with 10M
>> yes. It started eating too much memory and swap. The objects are not
>> reclaimed (and no garbage collector helps, not resetting the objects,
>> that could mean object outside of Python, int he c-extension -
>> however, after some time it got somewhat better; the 'somewhat' isn't
>> very precise, I know :)). I think it might be a mysqldb bug.
>>
>> In [11]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 9.19 s, sys: 0.19 s, total: 9.38 s
>> Wall time: 12.55 s
>>
>> In [13]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 8.65 s, sys: 0.11 s, total: 8.76 s
>> Wall time: 11.88 s
>>
>> In [15]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 9.30 s, sys: 0.07 s, total: 9.37 s
>> Wall time: 12.52 s
>>
>> In [17]: len(res)
>> Out[17]: 10000000
>>
>> In [18]: i = range(50000000)
>>
>> In [19]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 13.21 s, sys: 3.06 s, total: 16.27 s
>> Wall time: 71.10 s
>>
>> In [21]: import gc
>>
>> In [22]: gc.collect()
>> Out[22]: 3
>>
>> In [23]: i = []
>>
>> In [24]: gc.collect()
>> Out[24]: 3
>>
>> In [25]: gc.isenabled()
>> Out[25]: True
>>
>> In [26]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 14.48 s, sys: 4.38 s, total: 18.86 s
>> Wall time: 56.63 s
>>
>> In [28]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 10.59 s, sys: 1.20 s, total: 11.79 s
>> Wall time: 24.57 s
>>
>> In [30]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 10.28 s, sys: 0.30 s, total: 10.58 s
>> Wall time: 15.50 s
>>
>> In [32]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 8.70 s, sys: 0.09 s, total: 8.79 s
>> Wall time: 12.17 s
>>
>> In [34]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 9.79 s, sys: 0.26 s, total: 10.05 s
>> Wall time: 14.50 s
>>
>> In [36]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 9.91 s, sys: 0.31 s, total: 10.22 s
>> Wall time: 15.92 s
>>
>> In [38]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 8.63 s, sys: 0.11 s, total: 8.74 s
>> Wall time: 12.05 s
>>
>> In [40]: len(i)
>> Out[40]: 0
>>
>> In [41]: i = range(50000000)
>>
>> In [42]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 22.97 s, sys: 4.95 s, total: 27.92 s
>> Wall time: 78.04 s
>>
>> In [44]: i = []
>>
>> In [45]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 15.02 s, sys: 3.32 s, total: 18.34 s
>> Wall time: 52.45 s
>>
>> In [47]: gc.collect()
>> Out[47]: 3
>>
>> In [48]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>> CPU times: user 15.81 s, sys: 4.08 s, total: 19.89 s
>> Wall time: 57.72 s
>>
>>
>> roman
>>
>> On Thu, Dec 22, 2011 at 2:26 PM, Benoit Thiell <bthi...@cfa.harvard.edu> 
>> wrote:
>>> Hi Jorban.
>>>
>>> Thanks for taking the time to advise on this issue. The query I used
>>> is not the one that caused problem originally, I just chose it because
>>> it returns a large number of results. So I can't really change
>>> anything on this side.
>>>
>>> Also the query runs fine by itself, without dictionaries loaded in
>>> memory. So I think that it would be nice to try and get a sense of how
>>> much big data structures loaded in memory impact large SQL queries.
>>>
>>> Benoit.
>>>
>>> On Thu, Dec 22, 2011 at 5:26 AM, Ferran Jorba <ferran.jo...@uab.cat> wrote:
>>>> Hello Benoit,
>>>>
>>>> [...]
>>>>>     In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x
>>>>> LIMIT 1000000")CPU times: user 1.96 s, sys: 0.06 s, total: 2.02 s
>>>>>     Wall time: 2.30 s
>>>>>
>>>>> Any idea about why we're seeing this and how we can fix it? It is
>>>>> quite a big problem for us as our citation dictionaries are so big.
>>>>
>>>> I have noticed in more than one case that for some minimally complex
>>>> (?!)  operations the bottleneck is MySQL, not Python, so if can move
>>>> part of the manipulation from one the other you have surprises.  I
>>>> cannot remember the exact case, but the equivalent with yours should be
>>>> changing:
>>>>
>>>>  res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000")
>>>>
>>>> to:
>>>>
>>>>  res = run_sql("SELECT id_bibrec FROM bibrec_bib03x")
>>>>  res = res[:1000000]
>>>>
>>>> I remember gains of 10x.  YMMV, but you can try it.
>>>>
>>>> Ferran
>>>
>>>
>>>
>>> --
>>> Benoit Thiell
>>> The SAO/NASA Astrophysics Data System
>>> http://adswww.harvard.edu/
>
>
>
> --
> Benoit Thiell
> The SAO/NASA Astrophysics Data System
> http://adswww.harvard.edu/

Reply via email to