[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler

2018-05-04 Thread GitBox
leezu commented on issue #10768: Use numpy in RandomSampler
URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-386780280
 
 
   Sounds great, thanks!
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler

2018-05-04 Thread GitBox
leezu commented on issue #10768: Use numpy in RandomSampler
URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-386779158
 
 
   On my personal computer indeed I experience the same speed-up of mxnet 
compared to numpy. On the other machines the results I quoted above still 
stand. I guess in the end this depends a lot on the particular system and the 
build options of the libraries, though it is strange given your explanation 
about the implementation. As this code is only run once per epoch to shuffle 
the dataset I believe it is not that important if it takes 200ms or 500ms for 
large datasets. It was just unbearable that it took 10s+ before.
   
   I don't have a strong feeling about changing it, though I won't propose such 
change myself given that I had mixed results depending on the computer. If you 
open a PR and someone is willing to merge it I won't mind.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler

2018-05-04 Thread GitBox
leezu commented on issue #10768: Use numpy in RandomSampler
URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-386742567
 
 
   @asitstands I guess the difference between our experiments is that I used a 
optimized numpy from conda and the standard [mxnet pypi 
build](https://pypi.org/pypi/mxnet).
   
   Using both optimized numpy and an optimized mxnet build on AWS p3 instance I 
do observe like you that mxnet is faster for small sizes (4): ~500μs vs 
~800μs of numpy  For large sizes (12525568) the `asnumpy()` overhead is however 
large and the numpy version takes just 180ms compared to 600ms with the mxnet 
code.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler

2018-05-03 Thread GitBox
leezu commented on issue #10768: Use numpy in RandomSampler
URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-386511294
 
 
   @asitstands above timings where taken with an array size of 12525568. I have 
tried again with size 4 and get the following:
   ```
   
   In [3]: %timeit list(sample_np(4))
   2.08 ms ± 96.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
   
   
   In [5]: %timeit list(sample_mx(4))
   3.12 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   
   ```
   
   Are you taking the overhead of converting to scalars into account?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler

2018-05-02 Thread GitBox
leezu commented on issue #10768: Use numpy in RandomSampler
URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-386136121
 
 
   Thanks @asitstands . Comparing the numpy code to the mx.nd code you provided 
results in the following performance on my machine:
   ```
   
   In [3]: %timeit list(sample_mx(1529*8192))
   2.17 s ± 188 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   
   In [4]: %timeit list(sample_np(1529*8192))
   1.3 s ± 73.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   
   ```
   
   So relying on mx.nd.random.shuffle + asnumpy seems to add an extra second.
   
   Regarding RNG, our test cases set both numpy and mxnet seeds. I believe 
other parts of mxnet also use numpy random, so it may be good to document that 
both seeds must be set to get deterministic behavior. If this is the only place 
numpy.random is used it may be worth the extra second to stay consistent?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] leezu commented on issue #10768: Use numpy in RandomSampler

2018-05-01 Thread GitBox
leezu commented on issue #10768: Use numpy in RandomSampler
URL: https://github.com/apache/incubator-mxnet/pull/10768#issuecomment-385826952
 
 
   It doesn't perform well. At least not with the naive approach of:
   
   ```
   ...: def sample(length):
   ...: indices = mx.nd.arange(length)
   ...: mx.nd.random.shuffle(indices)
   ...: return (indices[i].asscalar() for i in range(indices.shape[0]))
   
   ```
   
   What did you have in mind?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services