reallocating a buffer is not an option, it should be dynamic
On Sunday, May 8, 2022 at 1:26:34 PM UTC-7 Const V wrote:
> Using r.ReadLine() I can successfully read 100 MB line in a string, using
> the following conditional statement which is
> increasing the buffer until '\n' is encountered.
> for isPrefix && err == nil {
> line, isPrefix, err = r.ReadLine()
> ln = append(ln, line...)
> }
>
> Now the last problem is how to search a substring in 100MB string
>
> On Saturday, May 7, 2022 at 10:25:29 PM UTC-7 Amnon wrote:
>
>> So you raise a couple of questions:
>>
>> 1) How about handling runes?
>> The nice thing about utf8 is you don't have to care. If you are searching
>> for the word ascii byte 'test', you can
>> simply compare byte by byte - the letter t is represented by 0x74, and
>> this byte in the search buffer can
>> only represent t - it can never be part of a multi-byte-rune - not in
>> utf8.
>>
>> 2) Can we handle lines of hundreds of MB?
>> Not really. The default will limit MaxTokenSize to 64K.
>> If you set your own buffer (recommended) then the size of this buffer
>> will be your maximum line length.
>> Unfortunately you can't really solve this problem without slurping the
>> entire file into memory. Because
>> the whole file could consist of a single line, with the word "test" at
>> the very end - and at this point, you will
>> still need access to the beginning of the file in order to output it.
>>
>> 3) The Mike Haertel message (from 2010) is interesting, as is the
>> reference to the paper
>> "Fast String Searching", by Andrew Hume and Daniel Sunday, in the
>> November 1991 issue of Software Practice & Experience
>> All this is a quite old. Computer architecture has changed significantly
>> since then, CPUs have got a lot faster and added vector
>> instructions which speed up dumb algorithms a lot. The relative cost of
>> accessing memory has massively increased, as memory
>> speeds have not increased at anything like the rate of CPU speeds (though
>> the Apple M1 system on chips family has changed this)
>> so memory -> CPU is often the bottleneck, and optimising CPU performance
>> through clever algorithms often merely results in
>> the CPU being stuck in stall cycles waiting for cache misses to be
>> serviced.
>> That said, some of the points Mike makes are still valid. One is the
>> massive cost of breaking up lines - which slows down performance
>> by an order of magnitude, by forcing you to examine every byte. So
>> searching for the search-text, and then when you find it, searching
>> backwards
>> from that point to find the enclosing line, is a big win.
>> The other point he makes which is still valid is to consider the cost of
>> copying data.
>> This will probably be the dominant cost. Running `cat > /dev/null` will
>> tell you how long it takes to merely read the data, and this will give
>> you a baseline figure. A reasonably carefully written grep clone should
>> achieve roughly the same runtime.
>>
>> As you specified the problem as reading from stdin, using memory mapped
>> files, using faster filesystems and fast SSDs will not help you.
>>
>> And, as always in Go, allocations trump almost everything else. If you
>> end up converting each line into a string, forcing allocations and GC work,
>> that will knock orders of magnitude off your performance.
>>
>> Disclaimer: I have not actually written any code or run any benchmarks
>> here. But there are some interesting questions which could make an
>> interesting paper. To what extent has Hume and Sunday's 1991 paper been
>> invalidated by hardware changes? Are smart algorithm's which
>> avoid looking at every byte still a win, or do they actually slow down
>> performance by adding branch mispredictions, and preventing vectorisation
>> optimisations? To what extent does the Apple/Arm system on chip family,
>> with their much lower memory latency, change these considerations?
>> How much slower will a simple Go clone of grep be than the C version
>> which has had decades of optimisation?
>> And how much effort would it take to optimise the Go clone to reach the
>> performance of the C version?
>> If anyone wants to write the code, do the benchmarks, and publish the
>> conclusions, then send me a link. It will be an interesting read.
>>
>>
>> On Sunday, 8 May 2022 at 00:55:28 UTC+1 Const V wrote:
>>
>>> I cannot use fgrep in this environment.
>>>
>>> Looks like Scan buffer is resizable.
>>>
>>> https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;l=190
>>> // Is the buffer full? If so, resize.
>>> if s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .end
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36>
>>>
>>> == len(s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .buf
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>>>
>>> {
>>> // Guarantee no overflow in the multiplication below.
>>> const maxInt
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=maxInt&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23const%2520%255Bscan.go%2523192%255D.maxInt>
>>>
>>> = int
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=int&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23int%2523builtin>
>>> (^uint
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=uint&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23uint%2523builtin>(0)
>>>
>>> >> 1)
>>> if len(s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .buf
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>>>
>>> >= s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .maxTokenSize
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32>
>>>
>>> || len(s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .buf
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>>>
>>> > maxInt
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=192>/2
>>>
>>> {
>>> s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .setErr
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=252>
>>> (ErrTooLong
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=69>)
>>>
>>>
>>> return false
>>> }
>>> newSize
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=198?gsn=newSize&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23var%2520%255Bscan.go%2523197%255D.newSize>
>>>
>>> := len(s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .buf
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>>>
>>> * 2
>>> if newSize
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>>>
>>> == 0 {
>>> newSize
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>>>
>>> = startBufSize
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=82>
>>>
>>> }
>>> if newSize
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>>>
>>> > s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .maxTokenSize
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32>
>>>
>>> {
>>> newSize
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>>>
>>> = s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .maxTokenSize
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32>
>>>
>>> }
>>> newBuf
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=205?gsn=newBuf&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23var%2520%255Bscan.go%2523204%255D.newBuf>
>>>
>>> := make([]byte
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=205?gsn=byte&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23byte%2523builtin>,
>>>
>>> newSize
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>)
>>>
>>>
>>> copy(newBuf
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=204>,
>>>
>>> s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .buf
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>
>>> [s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .start
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35>
>>> :s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .end
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36>])
>>>
>>>
>>> s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .buf
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>
>>>
>>> = newBuf
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=204>
>>>
>>> s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .end
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36>
>>>
>>> -= s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .start
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35>
>>>
>>> s
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>>> .start
>>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35>
>>>
>>> = 0
>>> }
>>> On Saturday, May 7, 2022 at 4:43:59 PM UTC-7 kortschak wrote:
>>>
>>>> On Sat, 2022-05-07 at 16:16 -0700, Const V wrote:
>>>> > The question is will scanner.Scan handle a line of 100s MB?
>>>>
>>>> No, at least not by default (https://pkg.go.dev/bufio#Scanner.Buffer).
>>>> But that that point you want to start questioning why you're doing what
>>>> you're doing.
>>>>
>>>> Your invocation of grep can be a lot simpler
>>>>
>>>> ```
>>>> func grep(match string, r io.Reader, stdout, stderr io.Writer) error {
>>>> cmd := exec.Command("fgrep", match)
>>>> cmd.Stdin = r
>>>> cmd.Stdout = stdout
>>>> cmd.Stderr = stderr
>>>> return cmd.Run()
>>>> }
>>>> ```
>>>>
>>>> You can then do whatever you want with the buffered output and stderr.
>>>> You can also make this use a pipe with minor changes if you expect the
>>>> output to be long and that stream processing of that would be sensible
>>>> (use cmd.Start instead of Run and look into StdoutPipe).
>>>>
>>>>
>>>>
--
You received this message because you are subscribed to the Google Groups
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/golang-nuts/56220b9b-841b-40c7-b7fd-22e02f189b19n%40googlegroups.com.