[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

harishreedharan Sun, 01 Nov 2015 18:29:47 -0800

Github user harishreedharan commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-152893856
  
    Did you try HDFS? I am assuming we'd get similar speed ups there too but in
    that case there are far fewer files in which case the cost to setup the
    streams are paid only a handful of times.
    
    What I am wondering is if we'd actually ever have to deal with that many
    files in the non-S3 case. This adds the additional cost for HDFS or any
    other FS, no? In those cases the number of files usually would be pretty
    small, which may result in this being more expensive.
    
    If this adds only a small cost or if it becomes faster, then let's keep
    this.
    
    
    On Sunday, November 1, 2015, Burak Yavuz <notificati...@github.com
    <javascript:_e(%7B%7D,'cvml','notificati...@github.com');>> wrote:
    
    > @harishreedharan <https://github.com/harishreedharan> Here are some
    > benchmark results:
    > For reference, the driver was a r3.2xlarge EC2 instance.
    >
    > [image: image]
    > 
<https://cloud.githubusercontent.com/assets/5243515/10871515/54c14846-809e-11e5-91e6-2ac3605d98b7.png>
    > Num Threads Rate (ms / file) Speed-up 50 5.556101934 9.004997951 25
    > 5.99898194 8.340196225 8 8.692144733 5.756080699 4 14.1162362 3.544336169
    > 1 50.03268653 1
    >
    > â
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/9373#issuecomment-152867985>.
    >
    
    
    -- 
    
    Thanks,
    Hari




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Reply via email to