[Boston.pm] ackathon design notes

Uri Guttman Mon, 14 Jul 2008 23:11:38 -0700

hi all,

here are some design and coding notes for my vision of a new core loop
for ack. i divided it up into three parts, reading lines, matching and
tracking, and printing results. i don't go over all the possible options
as that is too detailed for this email. we will discuss them as we hack
the code tomorrow.


the current ack uses a <$fs> to read each line from the input. if you
want raw speed in perl i/o you have to bypass its i/o system (whether it
is perlio or the older stdio. i have done this in 2 cpan modules noted
for their speed, File::Slurp and File::ReadBackwards. the latter has
code that is very close to what we would want for the ack read line. it
needs to be modified to only read forward (rip out all the seek stuff)
but the guts are close to usable as is. the other issue is what api do
we use inside the loop. a simple call to a readline is slow as it means
a sub call per line. it doesn't matter how that sub is written, the call
overhead adds up. so a better way is to manage a buffer of lines and get
a line from it in the core loop if we can. else we call the sub to
refill the buffer with lines. then the sub calls will be negligible but
we need to code the conditon to get a line quickly.

here is a sample code version of the buffer/readline stuff:

        refill_lines_buffer( $fh ) if @lines <= 2 ;
        my $line = shift @lines ;
        last unless defined $line ;

the reason for <= 2 is that if you have one 'line' left in the buffer it
could be a partial line with the rest of it in the next block to be
read. refill_lines_buffer will read blocks and split lines until it
finds more than one line or eof (then the last line is full even if no
trailing \n).

here is the code from File::ReadBackwards that is the readbuffer stuff:

# see if there is more than 1 line in the buffer

                if ( @{$lines_ref} > 1 ) {

# we have a complete line so return it
# and convert those damned cr/lf lines to \n

                        $lines_ref->[-1] =~ s/\015\012/\n/
                                        if $self->{'is_crlf'} ;

                        return( pop @{$lines_ref} ) ;
                }

# we don't have a complete, so have to read blocks until we do

                my $seek_pos = $self->{'seek_pos'} ;

# see if we are at the beginning of the file

                if ( $seek_pos == 0 ) {

# the last read never made more lines, so return the last line in the buffer
# if no lines left then undef will be returned
# and convert those damned cr/lf lines to \n

                        $lines_ref->[-1] =~ s/\015\012/\n/
                                        if @{$lines_ref} && $self->{'is_crlf'} ;

                        return( pop @{$lines_ref} ) ;
                }

# we have to read more text so get the handle and the current read size

                my $handle = $self->{'handle'} ;
                my $read_size = $self->{'read_size'} ;

# after the first read, always read the maximum size

                $self->{'read_size'} = $max_read_size ;

# seek to the beginning of this block and save the new seek position

                $seek_pos -= $read_size ;
                sysseek( $handle, $seek_pos, SEEK_SET ) ;
                $self->{'seek_pos'} = $seek_pos ;

# read in the next (previous) block of text

                my $read_cnt = sysread( $handle, $read_buf, $read_size ) ;

# prepend the read buffer to the leftover (possibly partial) line

                my $text = $read_buf ;
                $text .= shift @{$lines_ref} if @{$lines_ref} ;

# split the buffer into a list of lines
# this may want to be $/ but reading files backwards assumes plain text and
# newline separators

                @{$lines_ref} = ( $self->{'sep_is_regex'} ) ?
                        $text =~ /(.*?$self->{'rec_sep'}|.+)/gs :
                        $text =~ /(.*?\Q$self->{'rec_sep'}\E|.+)/gs ;

#print "Lines \n=>", join( "<=\n=>", @{$lines_ref} ), "<=\n" ;

        }

much of that can be removed (backwards seeking stuff). and you can
change the partial line merge stuff to go forward (append to
$lines_ref[0] and not prepend!)

also the separator stuff can be simplified assuming only text files will
be acked. but that can be handled later.


-------------------------

the main loop of ack has to track two main types of data, the lines
being read and buffered for context (the range of lines before/after a
matched line) and the info about the matched lines (i call them hits)
themselves. ack does this with a convoluted set of data and slow copying
of the full lines. my idea is to use two shift registers (arrays with
push and shift) to hold the lines and the hit info. another important
trick is to only store references to the lines and move those
around. that will save tons of ram and copying costs. also there is a
need to handle prefixing and marking lines (highlighting) them where
hits are found.

the line context buffer will hold each a ref to each line as it comes in
and shifts off lines when its count is greater than the context
size. this is a shift register with a max size of the context size.

the hit buffer will have a hit structure (hash ref) pushed onto it when
the current line matches the regex. and for each line coming in, the
first element in the hit buffer will be checked to see if it is ready to
print (the 'after' number context lines have been read in). the hit
contains stuff like the line number of the hit, the marked up line
(highlight the match, etc.), the special hit prefix (the context lines
get simpler prefixes), etc.

now that i mentioned prefixes, this is a good time to cover them a
bit. when a context is printed, each context line is marked with some
form of prefix designated by command line options. this can be line
numbers, file names, etc. a null prefix of a blank or : is possible i
think. the matched line gets its own special prefix. my idea is to store
context line prefixes with the line buffer. i don't want to prepend the
prefixes as that will entail more copying of line text. instead the line
buffer can hold a list of lists with each list having the prefix (or a
ref to it?) and the ref to the line.

how the hit structure tracks the after context count to determine when
to print needs to be designed. it has several possible solutions and i
haven't decided on one. we can talk about this at the meeting and pick
one. it is easily changed if we don't like it.

there are more options to deal with but that covers many of them and the
rest can be cleanly bolted on to this set of structures. there are open
detail issues and some possible design changes but this is a pretty good
start for a fast and clean core loop.

------------------------

printing a context

this happens when a matched line (a hit) has been found and enough
'after' context lines have been read in. so this involves printing the
line buffer but with one major difference. the hit line must be printed
with a different prefix and the hit line printed with optional
markup. one idea is to make those special hit strings in the hit
structure. then replace the LOL for the hit line in the line buffer (save the
original context line for reuse). now you can print the entire line
buffer (dereferencing as needed) in one print call.

-----------------

hopefully the design and notes will allow us to redo the core loop and
get most of it working. it is broken down into three parts and we can
split up the work (including merging, testing, etc) into several
teams. i expect to be sticking my large schnozz into each team. :)

thanx,

uri

-- 
Uri Guttman  ------  [EMAIL PROTECTED]  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------

_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

[Boston.pm] ackathon design notes

Reply via email to