[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler
leezu commented on issue #10768: Use numpy in RandomSampler URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-386780280 Sounds great, thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler
leezu commented on issue #10768: Use numpy in RandomSampler URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-386779158 On my personal computer indeed I experience the same speed-up of mxnet compared to numpy. On the other machines the results I quoted above still stand. I guess in the end this depends a lot on the particular system and the build options of the libraries, though it is strange given your explanation about the implementation. As this code is only run once per epoch to shuffle the dataset I believe it is not that important if it takes 200ms or 500ms for large datasets. It was just unbearable that it took 10s+ before. I don't have a strong feeling about changing it, though I won't propose such change myself given that I had mixed results depending on the computer. If you open a PR and someone is willing to merge it I won't mind. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler
leezu commented on issue #10768: Use numpy in RandomSampler URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-386742567 @asitstands I guess the difference between our experiments is that I used a optimized numpy from conda and the standard [mxnet pypi build](https://pypi.org/pypi/mxnet). Using both optimized numpy and an optimized mxnet build on AWS p3 instance I do observe like you that mxnet is faster for small sizes (4): ~500μs vs ~800μs of numpy For large sizes (12525568) the `asnumpy()` overhead is however large and the numpy version takes just 180ms compared to 600ms with the mxnet code. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler
leezu commented on issue #10768: Use numpy in RandomSampler URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-386511294 @asitstands above timings where taken with an array size of 12525568. I have tried again with size 4 and get the following: ``` In [3]: %timeit list(sample_np(4)) 2.08 ms ± 96.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [5]: %timeit list(sample_mx(4)) 3.12 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Are you taking the overhead of converting to scalars into account? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler
leezu commented on issue #10768: Use numpy in RandomSampler URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-386136121 Thanks @asitstands . Comparing the numpy code to the mx.nd code you provided results in the following performance on my machine: ``` In [3]: %timeit list(sample_mx(1529*8192)) 2.17 s ± 188 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [4]: %timeit list(sample_np(1529*8192)) 1.3 s ± 73.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` So relying on mx.nd.random.shuffle + asnumpy seems to add an extra second. Regarding RNG, our test cases set both numpy and mxnet seeds. I believe other parts of mxnet also use numpy random, so it may be good to document that both seeds must be set to get deterministic behavior. If this is the only place numpy.random is used it may be worth the extra second to stay consistent? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler
leezu commented on issue #10768: Use numpy in RandomSampler URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-385826952 It doesn't perform well. At least not with the naive approach of: ``` ...: def sample(length): ...: indices = mx.nd.arange(length) ...: mx.nd.random.shuffle(indices) ...: return (indices[i].asscalar() for i in range(indices.shape[0])) ``` What did you have in mind? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services