Re: [racket-users] Cleanest way to locate contiguous sequences? (as part of reuniting segments of a file)

2016-12-02 Thread David Storrs
Hi Jon,

That sounds excellent.  Always preferable to find a better algorithm, and I
wasn't familiar with interval trees.

Thanks for the pointer.

Dave

On Fri, Dec 2, 2016 at 3:18 PM, Jon Zeppieri  wrote:

> You could use an interval tree instead of a list. Before inserting into
> it, you could check to see if your current chunk number is contiguous with
> an existing interval in the tree. If so, merge the chunks and expand the
> existing interval. Otherwise, insert an interval of size 1 into the tree.
>
> Stephen Chang's ftree package might be useful here.
>
> - Jon
>
>
>
> On Fri, Dec 2, 2016 at 2:39 PM, David Storrs 
> wrote:
>
>> This is a more business-logic as opposed to syntax/technique question
>> than usually shows up on this list, but hopefully folks won't mind.
>>
>> I started off to ask "what's the best way to find contiguous sequences of
>> numbers in a list?" and then realized that maybe I should answer the "What
>> are you trying to achieve?" question first.
>>
>> Short version:
>> Tactical problem:  Given a list of sorted numbers, how best to identify
>> contiguous sequences within the list?
>> Strategic problem:   How best to put a chunked-up file back together
>> given a (possibly incomplete) set of chunks and the index number of those
>> chunks?
>>
>> Long version:
>> I have a file that's been broken up into chunks of roughly standard size
>> and I want to put the file back together.  I have the chunk number for each
>> chunk so I know what order they should be concatenated in.  I may not have
>> all the chunks, but I don't want to wait for all of them to arrive before
>> starting the assembly, so I'll need to be able to be able to generate
>> intermediate products and add to them later when the assembly process is
>> re-run.
>>
>> My database has filepath and chunk-number data for all chunks (even the
>> ones that haven't arrived yet; the filepaths are predictable), so I know
>> where the chunks will be and what order to assemble them in.
>>
>> My first thought is to get the sorted list of chunk numbers for all the
>> chunks that I currently have, locate contiguous sub-sequences, and assemble
>> those sub sequences, then assemble the sub-sequences once they become
>> contiguous.  Something like this:
>>
>> Available chunk nums:  '(1 2 3 5 7 200 201 202 203)
>>
>> Group them by contiguous:  '( (1 2 3) (5) (7) (200 201 202 203))
>>
>> Generate superchunks:  1-3, 200-203
>>
>> Later...
>>
>> Available chunk nums:  '(1-3 4 5 7 190 193 200-203)
>>
>> Group them by contiguous:  '( (1-3 4 5) (7) (190) (193) (200-203))
>>
>> Merge 4 and 5 into 1-3, rename 1-3 as 1-5.
>>
>> ...etc
>>
>> I realized I don't actually know a simple way to identify contiguous
>> sub-sequences.  I came up with the following, but it feels clumsy.
>>
>> (define-values (n final)
>>   (for/fold ((prev (car nums))
>>  (acc '())
>>  )
>> ((n (cdr nums)))
>> (values n
>> (if (= n (add1 prev))
>>(cons n acc)
>>(begin
>>   (set! result (cons (reverse acc) result))
>>   (list n
>> ))
>> (cons (reverse final) result)
>>
>> Given this:  '(1 2 3 5 7 200 201 202 203)
>> It yields this: '((200 201 202 203) (7) (5) (2 3))
>>
>> Which is fine -- I don't care about the order of the sublists within the
>> overall list, as long as each sublist is itself sorted ascending.
>>
>> Does anyone have any advice to offer, on either the tactical or strategic
>> problem?
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Racket Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to racket-users+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Cleanest way to locate contiguous sequences? (as part of reuniting segments of a file)

2016-12-02 Thread Jon Zeppieri
You could use an interval tree instead of a list. Before inserting into it,
you could check to see if your current chunk number is contiguous with an
existing interval in the tree. If so, merge the chunks and expand the
existing interval. Otherwise, insert an interval of size 1 into the tree.

Stephen Chang's ftree package might be useful here.

- Jon



On Fri, Dec 2, 2016 at 2:39 PM, David Storrs  wrote:

> This is a more business-logic as opposed to syntax/technique question than
> usually shows up on this list, but hopefully folks won't mind.
>
> I started off to ask "what's the best way to find contiguous sequences of
> numbers in a list?" and then realized that maybe I should answer the "What
> are you trying to achieve?" question first.
>
> Short version:
> Tactical problem:  Given a list of sorted numbers, how best to identify
> contiguous sequences within the list?
> Strategic problem:   How best to put a chunked-up file back together given
> a (possibly incomplete) set of chunks and the index number of those chunks?
>
> Long version:
> I have a file that's been broken up into chunks of roughly standard size
> and I want to put the file back together.  I have the chunk number for each
> chunk so I know what order they should be concatenated in.  I may not have
> all the chunks, but I don't want to wait for all of them to arrive before
> starting the assembly, so I'll need to be able to be able to generate
> intermediate products and add to them later when the assembly process is
> re-run.
>
> My database has filepath and chunk-number data for all chunks (even the
> ones that haven't arrived yet; the filepaths are predictable), so I know
> where the chunks will be and what order to assemble them in.
>
> My first thought is to get the sorted list of chunk numbers for all the
> chunks that I currently have, locate contiguous sub-sequences, and assemble
> those sub sequences, then assemble the sub-sequences once they become
> contiguous.  Something like this:
>
> Available chunk nums:  '(1 2 3 5 7 200 201 202 203)
>
> Group them by contiguous:  '( (1 2 3) (5) (7) (200 201 202 203))
>
> Generate superchunks:  1-3, 200-203
>
> Later...
>
> Available chunk nums:  '(1-3 4 5 7 190 193 200-203)
>
> Group them by contiguous:  '( (1-3 4 5) (7) (190) (193) (200-203))
>
> Merge 4 and 5 into 1-3, rename 1-3 as 1-5.
>
> ...etc
>
> I realized I don't actually know a simple way to identify contiguous
> sub-sequences.  I came up with the following, but it feels clumsy.
>
> (define-values (n final)
>   (for/fold ((prev (car nums))
>  (acc '())
>  )
> ((n (cdr nums)))
> (values n
> (if (= n (add1 prev))
>(cons n acc)
>(begin
>   (set! result (cons (reverse acc) result))
>   (list n
> ))
> (cons (reverse final) result)
>
> Given this:  '(1 2 3 5 7 200 201 202 203)
> It yields this: '((200 201 202 203) (7) (5) (2 3))
>
> Which is fine -- I don't care about the order of the sublists within the
> overall list, as long as each sublist is itself sorted ascending.
>
> Does anyone have any advice to offer, on either the tactical or strategic
> problem?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Cleanest way to locate contiguous sequences? (as part of reuniting segments of a file)

2016-12-02 Thread David Storrs
Thanks, Dan.  Working with shifted lists is an interesting technique that
I'll keep in my quiver.  I appreciate you taking the time.

On Fri, Dec 2, 2016 at 1:56 PM, Daniel Prager 
wrote:

> Perhaps this is a more elegant approach to the tactical problem:
> simultaneously iterating over displaced versions of the original list. E.g.
>
> _x '(1 1 2 3   5   7 200 201 202) ; shifted right
> x  '(1 2 3 5   7 200 201 202 203) ; original list
> x_ '(2 3 5 7 200 201 202 203 203) ; shifted left
>
> When (= (+ _x 1) x (- x_ 1)) we are mid-run. Otherwise we're at the start
> or end of a run. Code below.
>
> A further displacement is needed to deal with extended runs: x__ to skip
> elements in long runs.
>
>
> #lang racket
>
> ; Given a strictly ascending list of numbers, summarise runs
> ; (... b b+1 ... b+n c ...) -> (... a b - b+n c ...)
> (define (runify xs)
>   (define MIN (first xs))
>   (define MAX (last xs))
>   (for/list ([_x (cons MIN xs)]
>  [x xs]
>  [x_ (append (rest xs) (list MAX))]
>  [x__ (append (drop xs 2) (list MAX MAX))]
>  #:unless (= (+ _x 2) x_ (- x__ 1)))
> (if (= (+ _x 1) x (- x_ 1))
> '-
> x)))
>
> ; Replace runify-ed lists with lists of lists
> ; E.g. '(1 - 3 8 12 15 - 100) -> '((1 3) (8) (12) (15 100))
> ;
> (define (chunkify xs [acc null])
>   (cond [(null? xs) acc]
> [(equal? (first xs) '-)
>  (chunkify (drop xs 2)
>(cons (list (caar acc) (second xs))
>  (rest acc)))]
> [else
>  (chunkify (rest xs) (cons (list (first xs)) acc))]))
>
>
> (define xs '(1 2 3 5 7 200 201 202 203))
>
> (runify xs) ; '(1 - 3 5 7 200 - 203)
>
> (reverse (chunkify (runify xs))) ; '((1 3) (5) (7) (200 203))
>
>
> * * *
>
> Racket question: How do I define a sequence generator (in-shifted-list xs
> shift) to make runify work in true single pass fashion?
>
> I would like to be able to write the following version of runify, without
> generating auxiliary lists:
>
> (define (runify-efficient xs)
>   (for/list ([_x (in-shifted-list xs -1)]
>  [x xs]
>  [x_ (in-shifted-list xs 1)]
>  [x__ (in-shifted-list xs 2)]
>  #:unless (= (+ _x 2) x_ (- x__ 1)))
> (if (= (+ _x 1) x (- x_ 1))
> '-
> x)))
>
> Dan
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Cleanest way to locate contiguous sequences? (as part of reuniting segments of a file)

2016-12-02 Thread Daniel Prager
Perhaps this is a more elegant approach to the tactical problem:
simultaneously iterating over displaced versions of the original list. E.g.

_x '(1 1 2 3   5   7 200 201 202) ; shifted right
x  '(1 2 3 5   7 200 201 202 203) ; original list
x_ '(2 3 5 7 200 201 202 203 203) ; shifted left

When (= (+ _x 1) x (- x_ 1)) we are mid-run. Otherwise we're at the start
or end of a run. Code below.

A further displacement is needed to deal with extended runs: x__ to skip
elements in long runs.


#lang racket

; Given a strictly ascending list of numbers, summarise runs
; (... b b+1 ... b+n c ...) -> (... a b - b+n c ...)
(define (runify xs)
  (define MIN (first xs))
  (define MAX (last xs))
  (for/list ([_x (cons MIN xs)]
 [x xs]
 [x_ (append (rest xs) (list MAX))]
 [x__ (append (drop xs 2) (list MAX MAX))]
 #:unless (= (+ _x 2) x_ (- x__ 1)))
(if (= (+ _x 1) x (- x_ 1))
'-
x)))

; Replace runify-ed lists with lists of lists
; E.g. '(1 - 3 8 12 15 - 100) -> '((1 3) (8) (12) (15 100))
;
(define (chunkify xs [acc null])
  (cond [(null? xs) acc]
[(equal? (first xs) '-)
 (chunkify (drop xs 2)
   (cons (list (caar acc) (second xs))
 (rest acc)))]
[else
 (chunkify (rest xs) (cons (list (first xs)) acc))]))


(define xs '(1 2 3 5 7 200 201 202 203))

(runify xs) ; '(1 - 3 5 7 200 - 203)

(reverse (chunkify (runify xs))) ; '((1 3) (5) (7) (200 203))


* * *

Racket question: How do I define a sequence generator (in-shifted-list xs
shift) to make runify work in true single pass fashion?

I would like to be able to write the following version of runify, without
generating auxiliary lists:

(define (runify-efficient xs)
  (for/list ([_x (in-shifted-list xs -1)]
 [x xs]
 [x_ (in-shifted-list xs 1)]
 [x__ (in-shifted-list xs 2)]
 #:unless (= (+ _x 2) x_ (- x__ 1)))
(if (= (+ _x 1) x (- x_ 1))
'-
x)))

Dan

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Cleanest way to locate contiguous sequences? (as part of reuniting segments of a file)

2016-12-02 Thread David Storrs
This is a more business-logic as opposed to syntax/technique question than
usually shows up on this list, but hopefully folks won't mind.

I started off to ask "what's the best way to find contiguous sequences of
numbers in a list?" and then realized that maybe I should answer the "What
are you trying to achieve?" question first.

Short version:
Tactical problem:  Given a list of sorted numbers, how best to identify
contiguous sequences within the list?
Strategic problem:   How best to put a chunked-up file back together given
a (possibly incomplete) set of chunks and the index number of those chunks?

Long version:
I have a file that's been broken up into chunks of roughly standard size
and I want to put the file back together.  I have the chunk number for each
chunk so I know what order they should be concatenated in.  I may not have
all the chunks, but I don't want to wait for all of them to arrive before
starting the assembly, so I'll need to be able to be able to generate
intermediate products and add to them later when the assembly process is
re-run.

My database has filepath and chunk-number data for all chunks (even the
ones that haven't arrived yet; the filepaths are predictable), so I know
where the chunks will be and what order to assemble them in.

My first thought is to get the sorted list of chunk numbers for all the
chunks that I currently have, locate contiguous sub-sequences, and assemble
those sub sequences, then assemble the sub-sequences once they become
contiguous.  Something like this:

Available chunk nums:  '(1 2 3 5 7 200 201 202 203)

Group them by contiguous:  '( (1 2 3) (5) (7) (200 201 202 203))

Generate superchunks:  1-3, 200-203

Later...

Available chunk nums:  '(1-3 4 5 7 190 193 200-203)

Group them by contiguous:  '( (1-3 4 5) (7) (190) (193) (200-203))

Merge 4 and 5 into 1-3, rename 1-3 as 1-5.

...etc

I realized I don't actually know a simple way to identify contiguous
sub-sequences.  I came up with the following, but it feels clumsy.

(define-values (n final)
  (for/fold ((prev (car nums))
 (acc '())
 )
((n (cdr nums)))
(values n
(if (= n (add1 prev))
   (cons n acc)
   (begin
  (set! result (cons (reverse acc) result))
  (list n
))
(cons (reverse final) result)

Given this:  '(1 2 3 5 7 200 201 202 203)
It yields this: '((200 201 202 203) (7) (5) (2 3))

Which is fine -- I don't care about the order of the sublists within the
overall list, as long as each sublist is itself sorted ascending.

Does anyone have any advice to offer, on either the tactical or strategic
problem?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.