hi all, here are some design and coding notes for my vision of a new core loop for ack. i divided it up into three parts, reading lines, matching and tracking, and printing results. i don't go over all the possible options as that is too detailed for this email. we will discuss them as we hack the code tomorrow.
the current ack uses a <$fs> to read each line from the input. if you want raw speed in perl i/o you have to bypass its i/o system (whether it is perlio or the older stdio. i have done this in 2 cpan modules noted for their speed, File::Slurp and File::ReadBackwards. the latter has code that is very close to what we would want for the ack read line. it needs to be modified to only read forward (rip out all the seek stuff) but the guts are close to usable as is. the other issue is what api do we use inside the loop. a simple call to a readline is slow as it means a sub call per line. it doesn't matter how that sub is written, the call overhead adds up. so a better way is to manage a buffer of lines and get a line from it in the core loop if we can. else we call the sub to refill the buffer with lines. then the sub calls will be negligible but we need to code the conditon to get a line quickly. here is a sample code version of the buffer/readline stuff: refill_lines_buffer( $fh ) if @lines <= 2 ; my $line = shift @lines ; last unless defined $line ; the reason for <= 2 is that if you have one 'line' left in the buffer it could be a partial line with the rest of it in the next block to be read. refill_lines_buffer will read blocks and split lines until it finds more than one line or eof (then the last line is full even if no trailing \n). here is the code from File::ReadBackwards that is the readbuffer stuff: # see if there is more than 1 line in the buffer if ( @{$lines_ref} > 1 ) { # we have a complete line so return it # and convert those damned cr/lf lines to \n $lines_ref->[-1] =~ s/\015\012/\n/ if $self->{'is_crlf'} ; return( pop @{$lines_ref} ) ; } # we don't have a complete, so have to read blocks until we do my $seek_pos = $self->{'seek_pos'} ; # see if we are at the beginning of the file if ( $seek_pos == 0 ) { # the last read never made more lines, so return the last line in the buffer # if no lines left then undef will be returned # and convert those damned cr/lf lines to \n $lines_ref->[-1] =~ s/\015\012/\n/ if @{$lines_ref} && $self->{'is_crlf'} ; return( pop @{$lines_ref} ) ; } # we have to read more text so get the handle and the current read size my $handle = $self->{'handle'} ; my $read_size = $self->{'read_size'} ; # after the first read, always read the maximum size $self->{'read_size'} = $max_read_size ; # seek to the beginning of this block and save the new seek position $seek_pos -= $read_size ; sysseek( $handle, $seek_pos, SEEK_SET ) ; $self->{'seek_pos'} = $seek_pos ; # read in the next (previous) block of text my $read_cnt = sysread( $handle, $read_buf, $read_size ) ; # prepend the read buffer to the leftover (possibly partial) line my $text = $read_buf ; $text .= shift @{$lines_ref} if @{$lines_ref} ; # split the buffer into a list of lines # this may want to be $/ but reading files backwards assumes plain text and # newline separators @{$lines_ref} = ( $self->{'sep_is_regex'} ) ? $text =~ /(.*?$self->{'rec_sep'}|.+)/gs : $text =~ /(.*?\Q$self->{'rec_sep'}\E|.+)/gs ; #print "Lines \n=>", join( "<=\n=>", @{$lines_ref} ), "<=\n" ; } much of that can be removed (backwards seeking stuff). and you can change the partial line merge stuff to go forward (append to $lines_ref[0] and not prepend!) also the separator stuff can be simplified assuming only text files will be acked. but that can be handled later. ------------------------- the main loop of ack has to track two main types of data, the lines being read and buffered for context (the range of lines before/after a matched line) and the info about the matched lines (i call them hits) themselves. ack does this with a convoluted set of data and slow copying of the full lines. my idea is to use two shift registers (arrays with push and shift) to hold the lines and the hit info. another important trick is to only store references to the lines and move those around. that will save tons of ram and copying costs. also there is a need to handle prefixing and marking lines (highlighting) them where hits are found. the line context buffer will hold each a ref to each line as it comes in and shifts off lines when its count is greater than the context size. this is a shift register with a max size of the context size. the hit buffer will have a hit structure (hash ref) pushed onto it when the current line matches the regex. and for each line coming in, the first element in the hit buffer will be checked to see if it is ready to print (the 'after' number context lines have been read in). the hit contains stuff like the line number of the hit, the marked up line (highlight the match, etc.), the special hit prefix (the context lines get simpler prefixes), etc. now that i mentioned prefixes, this is a good time to cover them a bit. when a context is printed, each context line is marked with some form of prefix designated by command line options. this can be line numbers, file names, etc. a null prefix of a blank or : is possible i think. the matched line gets its own special prefix. my idea is to store context line prefixes with the line buffer. i don't want to prepend the prefixes as that will entail more copying of line text. instead the line buffer can hold a list of lists with each list having the prefix (or a ref to it?) and the ref to the line. how the hit structure tracks the after context count to determine when to print needs to be designed. it has several possible solutions and i haven't decided on one. we can talk about this at the meeting and pick one. it is easily changed if we don't like it. there are more options to deal with but that covers many of them and the rest can be cleanly bolted on to this set of structures. there are open detail issues and some possible design changes but this is a pretty good start for a fast and clean core loop. ------------------------ printing a context this happens when a matched line (a hit) has been found and enough 'after' context lines have been read in. so this involves printing the line buffer but with one major difference. the hit line must be printed with a different prefix and the hit line printed with optional markup. one idea is to make those special hit strings in the hit structure. then replace the LOL for the hit line in the line buffer (save the original context line for reuse). now you can print the entire line buffer (dereferencing as needed) in one print call. ----------------- hopefully the design and notes will allow us to redo the core loop and get most of it working. it is broken down into three parts and we can split up the work (including merging, testing, etc) into several teams. i expect to be sticking my large schnozz into each team. :) thanx, uri -- Uri Guttman ------ [EMAIL PROTECTED] -------- http://www.sysarch.com -- ----- Perl Code Review , Architecture, Development, Training, Support ------ --------- Free Perl Training --- http://perlhunter.com/college.html --------- --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com --------- _______________________________________________ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm