Github user harishreedharan commented on the pull request: https://github.com/apache/spark/pull/9373#issuecomment-152893856 Did you try HDFS? I am assuming we'd get similar speed ups there too but in that case there are far fewer files in which case the cost to setup the streams are paid only a handful of times. What I am wondering is if we'd actually ever have to deal with that many files in the non-S3 case. This adds the additional cost for HDFS or any other FS, no? In those cases the number of files usually would be pretty small, which may result in this being more expensive. If this adds only a small cost or if it becomes faster, then let's keep this. On Sunday, November 1, 2015, Burak Yavuz <notificati...@github.com <javascript:_e(%7B%7D,'cvml','notificati...@github.com');>> wrote: > @harishreedharan <https://github.com/harishreedharan> Here are some > benchmark results: > For reference, the driver was a r3.2xlarge EC2 instance. > > [image: image] > <https://cloud.githubusercontent.com/assets/5243515/10871515/54c14846-809e-11e5-91e6-2ac3605d98b7.png> > Num Threads Rate (ms / file) Speed-up 50 5.556101934 9.004997951 25 > 5.99898194 8.340196225 8 8.692144733 5.756080699 4 14.1162362 3.544336169 > 1 50.03268653 1 > > â > Reply to this email directly or view it on GitHub > <https://github.com/apache/spark/pull/9373#issuecomment-152867985>. > -- Thanks, Hari
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org