Hi Benoit, My Python is behaving differently (see below). If I understand correctly, you are saying that a variable affects an operation of a function of a different scope - but I find it difficult to imagine. How is it possible? (other than consuming all RAM available and forcing Python to use swap).
Note, when I tried the first test, I noticed mysql running full speed even if I closed my ipython session. I had to restart the mysql server. Could it be that zombie mysql connections consume almost all of your RAM and force swapping after "i=range(5000000)"? In [1]: def run_sql(no): ...: return [(i,) for i in xrange(no)] ...: In [2]: def run_sql2(no): return [(i,) for i in range(no)] ...: In [4]: def run_sql3(): return [(i,) for i in range(1000000)] ...: In [6]: %time res = run_sql(1000000) CPU times: user 0.11 s, sys: 0.11 s, total: 0.22 s Wall time: 0.22 s In [8]: %time res = run_sql2(1000000) CPU times: user 0.21 s, sys: 0.13 s, total: 0.34 s Wall time: 0.34 s In [10]: %time res = run_sql3() CPU times: user 0.28 s, sys: 0.01 s, total: 0.29 s Wall time: 0.29 s In [12]: a = range(50000000) In [13]: %time res = run_sql(1000000) CPU times: user 2.25 s, sys: 0.04 s, total: 2.29 s Wall time: 2.29 s In [15]: %time res = run_sql2(1000000) CPU times: user 0.28 s, sys: 0.00 s, total: 0.28 s Wall time: 0.28 s In [17]: %time res = run_sql3() CPU times: user 0.27 s, sys: 0.00 s, total: 0.27 s Wall time: 0.27 s In [19]: %timeit res=run_sql(1000000) 10 loops, best of 3: 135 ms per loop In [20]: %timeit res=run_sql2(1000000) 10 loops, best of 3: 176 ms per loop In [21]: %timeit res=run_sql3() 10 loops, best of 3: 175 ms per loop On Fri, Dec 23, 2011 at 12:01 AM, Benoit Thiell <bthi...@cfa.harvard.edu> wrote: > Hi Roman. > > Thanks for your test. Here are the latest news: > > I tried with yet another MySQL bridge (https://launchpad.net/myconnpy) > and got very similar results. This makes me think that it is not a > problem that is specific to a Python MySQL bridge but maybe something > lower level. So, on an advice from Alberto, I ran the following test > which emulates a SQL request and returns a list of 1M tuples > containing incrementing integers: > > In [1]: def run_sql(): > ...: return [(i,) for i in range(1000000)] > > In [2]: %time res = run_sql() > CPU times: user 0.94 s, sys: 0.05 s, total: 0.99 s > Wall time: 0.99 s > > In [3]: i = range(50000000) > > In [4]: %time res = run_sql() > CPU times: user 10.33 s, sys: 0.04 s, total: 10.37 s > Wall time: 10.37 s > > In [5]: i = range(50000000) > > In [6]: %time res = run_sql() > CPU times: user 11.41 s, sys: 0.00 s, total: 11.41 s > Wall time: 11.41 s > > You can see that the creation of a data structure similar to the one > returned by run_sql() takes significantly longer when a big list is > already in memory. I don't quite understand why the third "SQL > request" takes not much longer than the second. > > This is a very serious problem for us because it means that any SQL > request will take a long time to return just because we have two > gigantic citation dictionaries loaded in memory. So either we find a > way to prevent this problem, or we need to find a way to not load the > dictionaries in memory. > > Benoit. > > On Thu, Dec 22, 2011 at 4:00 PM, Roman Chyla <roman.ch...@gmail.com> wrote: >> Hi, >> I have tried the same - could not reproduce with 1M rows, but with 10M >> yes. It started eating too much memory and swap. The objects are not >> reclaimed (and no garbage collector helps, not resetting the objects, >> that could mean object outside of Python, int he c-extension - >> however, after some time it got somewhat better; the 'somewhat' isn't >> very precise, I know :)). I think it might be a mysqldb bug. >> >> In [11]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 9.19 s, sys: 0.19 s, total: 9.38 s >> Wall time: 12.55 s >> >> In [13]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 8.65 s, sys: 0.11 s, total: 8.76 s >> Wall time: 11.88 s >> >> In [15]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 9.30 s, sys: 0.07 s, total: 9.37 s >> Wall time: 12.52 s >> >> In [17]: len(res) >> Out[17]: 10000000 >> >> In [18]: i = range(50000000) >> >> In [19]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 13.21 s, sys: 3.06 s, total: 16.27 s >> Wall time: 71.10 s >> >> In [21]: import gc >> >> In [22]: gc.collect() >> Out[22]: 3 >> >> In [23]: i = [] >> >> In [24]: gc.collect() >> Out[24]: 3 >> >> In [25]: gc.isenabled() >> Out[25]: True >> >> In [26]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 14.48 s, sys: 4.38 s, total: 18.86 s >> Wall time: 56.63 s >> >> In [28]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 10.59 s, sys: 1.20 s, total: 11.79 s >> Wall time: 24.57 s >> >> In [30]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 10.28 s, sys: 0.30 s, total: 10.58 s >> Wall time: 15.50 s >> >> In [32]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 8.70 s, sys: 0.09 s, total: 8.79 s >> Wall time: 12.17 s >> >> In [34]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 9.79 s, sys: 0.26 s, total: 10.05 s >> Wall time: 14.50 s >> >> In [36]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 9.91 s, sys: 0.31 s, total: 10.22 s >> Wall time: 15.92 s >> >> In [38]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 8.63 s, sys: 0.11 s, total: 8.74 s >> Wall time: 12.05 s >> >> In [40]: len(i) >> Out[40]: 0 >> >> In [41]: i = range(50000000) >> >> In [42]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 22.97 s, sys: 4.95 s, total: 27.92 s >> Wall time: 78.04 s >> >> In [44]: i = [] >> >> In [45]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 15.02 s, sys: 3.32 s, total: 18.34 s >> Wall time: 52.45 s >> >> In [47]: gc.collect() >> Out[47]: 3 >> >> In [48]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >> CPU times: user 15.81 s, sys: 4.08 s, total: 19.89 s >> Wall time: 57.72 s >> >> >> roman >> >> On Thu, Dec 22, 2011 at 2:26 PM, Benoit Thiell <bthi...@cfa.harvard.edu> >> wrote: >>> Hi Jorban. >>> >>> Thanks for taking the time to advise on this issue. The query I used >>> is not the one that caused problem originally, I just chose it because >>> it returns a large number of results. So I can't really change >>> anything on this side. >>> >>> Also the query runs fine by itself, without dictionaries loaded in >>> memory. So I think that it would be nice to try and get a sense of how >>> much big data structures loaded in memory impact large SQL queries. >>> >>> Benoit. >>> >>> On Thu, Dec 22, 2011 at 5:26 AM, Ferran Jorba <ferran.jo...@uab.cat> wrote: >>>> Hello Benoit, >>>> >>>> [...] >>>>> In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x >>>>> LIMIT 1000000")CPU times: user 1.96 s, sys: 0.06 s, total: 2.02 s >>>>> Wall time: 2.30 s >>>>> >>>>> Any idea about why we're seeing this and how we can fix it? It is >>>>> quite a big problem for us as our citation dictionaries are so big. >>>> >>>> I have noticed in more than one case that for some minimally complex >>>> (?!) operations the bottleneck is MySQL, not Python, so if can move >>>> part of the manipulation from one the other you have surprises. I >>>> cannot remember the exact case, but the equivalent with yours should be >>>> changing: >>>> >>>> res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000") >>>> >>>> to: >>>> >>>> res = run_sql("SELECT id_bibrec FROM bibrec_bib03x") >>>> res = res[:1000000] >>>> >>>> I remember gains of 10x. YMMV, but you can try it. >>>> >>>> Ferran >>> >>> >>> >>> -- >>> Benoit Thiell >>> The SAO/NASA Astrophysics Data System >>> http://adswww.harvard.edu/ > > > > -- > Benoit Thiell > The SAO/NASA Astrophysics Data System > http://adswww.harvard.edu/