Re: [s3ql] Re: S3QL 3.3 performance

Nikolaus Rath Wed, 14 Apr 2021 01:50:28 -0700

On Apr 14 2021, Grunthos <philip.john.war...@gmail.com> wrote:
> On Wednesday, April 14, 2021 at 5:36:30 PM UTC+10 niko...@rath.org wrote:
>
>>
>> Yes, all of these would be possible and probably be faster. I think 
>> option (2) would me the best one. 
>>
>> Pull requests are welcome :-). 
>>
>>
> I had a funny feeling that might be the answer...and in terms of utility 
> and design, ISTM that " add a special s3ql command to do a 'tree copy' -- 
> it would know exactly which blocks it needed and download them en-masse 
> while restoring files (and would need a lot of cache, possibly even a 
> temporary cache drive)" is a good plan.
>
> I am not at all sure I am up for the (probable) deep-dive required, but if 
> I were to look at this could you give some suggested starting points? My 
> very naieve approach (not knowing the internals at all) would be to build a 
> list of all required blocks, do some kind of topo sort, then start multiple 
> download threads. As each block was downloaded, determine if a new file can 
> be copied yet, and if so, copy it, then release and blocks that are no 
> longer needed.
>
> ...like I said, naieve, and hightly dependant on internals...and maybe 
> should use some kind of private mount to avoid horror.


I think there's a simpler solution.

1. Add a new special xattr to trigger the functionality (look at
s3qlcp.py and copy_tree() in fs.py) 

2. Have fs.py write directly to the destination directory (which should
be outside the S3QL mountpoint)

3. Start a number of async workers (no need for threads) that, in a
loop, download blocks and write them to a given offset in a given fh.

4. Have the main thread recursively traverse the source and issue "copy"
requests to the workers (through a queue)

5. Wait for all workers to finish.

6. Profit.


I wouldn't even bother putting blocks in the cache - just download and
write to the destination on the fly. It may be worth checking if a block
is *already* in the cache and, if so, skip download though.


With this implementation, blocks referenced by multiple files will be
downloaded multiple times. I think this can be improved upon once the
minimum functionality is working.


Best,
-Nikolaus


-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

-- 
You received this message because you are subscribed to the Google Groups 
"s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to s3ql+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/s3ql/874kg9l06t.fsf%40vostro.rath.org.

Re: [s3ql] Re: S3QL 3.3 performance

Reply via email to