On Sat, Nov 3, 2012 at 1:02 AM, Alejandro Pulver <[email protected]> wrote:
> On 11/02/2012 07:36 PM, Alejandro Pulver wrote:
>> On 11/02/2012 06:40 PM, Taavi Burns wrote:
>>> I see that you're reading from the compressed zip file directly. That makes 
>>> me suspect that your map/reduce is waiting for data from the 
>>> single-CPU-bound job of zip decompression.
>>>
>>> Try decompressing the archive first, and make sure all the files fit into 
>>> your OS' disk cache (or flush the cache between tests).
>> Sorry for the confusion. I've also been testing that version (which
>> actually runs a minute faster for CPython), but the times in my previous
>> mail are from "test1" which reads from disk files.
>>
>> Also note that the zip is a 300MB file I created from the extracted
>> files, not the 30MB 7z which would probably take too long to extract on
>> the fly.
>>
> Well, now that I mention it, there is something strange in these results
> as well (using "test2", the version which reads from a ZIP archive):
>
> $ time python test_mapreduce.py
> 170686
> python test_mapreduce.py  1869.19s user 11.44s system 357% cpu 8:46.44 total
>
> $ time ~/Downloads/pypy-1.9/bin/pypy test_mapreduce.py
> 170685
> ~/Downloads/pypy-1.9/bin/pypy test_mapreduce.py  889.64s user 15.32s
> system 182% cpu 8:17.20 total
>
> So CPython seems to runs faster without consuming more CPU (which is
> strange since it's decompressing). And PyPy is taking about twice as before.
> In an earlier version, I used a global variable for opening the zip, and
> used it from "func_map"; CPython worked the same, but PyPy consumed all
> my RAM and ran faster (instead of slower like the previous result shows).
>
> BTW, the result is different between CPython and PyPy (counts one word
> less). This might point to a bug.
>
> Regards,
> Alejandro
> _______________________________________________
> pypy-dev mailing list
> [email protected]
> http://mail.python.org/mailman/listinfo/pypy-dev

I guess one thing I can say is that without looking at your algorithm
it's impossible to say.

PyPy will spend more time pickling and unpickling (since it's slower)
but might be way faster at the actual processing. This might lead to
different time reports (as the message transport time will be higher).

For what is worth, maybe you should stop using multiprocessing (it's a
giant hack) and use explicit socket-based communication? I suggest
using something like twisted or execnet. You'll end up with a cleaner
model and likely with a faster solution.

Since the data is mostly read-only, you can also just run completely
separate processes that mmap the same data.

Cheers,
fijal
_______________________________________________
pypy-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/pypy-dev

Reply via email to