[go-nuts] files, readers, byte arrays (slices?), byte buffers and http.requests

2016-07-01 Thread Sri G
I'm working on receiving uploads through a form.

The tricky part is validation.

I attempt to read the first 1024 bytes to check the mime of the file and 
then if valid read the rest and hash it and also save it to disk. Reading 
the mime type is successful and I've gotten it to work by chaining 
TeeReader but it seems very hackish. Whats the idiomatic way to do this?

I'm trying something like this: 


// Parse my multi part form 
...
// Get file handle
file, err := fh.Open()

var a bytes.Buffer

io.CopyN(&a, file, 1024)

mime := mimemagic.Match("", a.Bytes())
// Check mime type (this works fine)

I'm trying to seek a stream so this should be no-op
file.Seek(0, 0)

The file stored on disk is 1KB larger than the original so it appears to be 
re-copying the entire file and appending it to bytes.Buffer
io.Copy(&a, file)

checksum := md5.New()
b := io.TeeReader(&a, checksum)

md5hex := hex.EncodeToString(checksum.Sum(nil))
fmt.Println("md5=", md5hex)

//Open file f for writing to disk
...
//Save file
io.Copy(f, b)


Checked the md5 of (1KB of orig + orig), and (orginal - first 1 KB), 
neither match the md5 of the file being hashed.

Why can't I append the rest of the stream to the byte buffer to get the 
complete file in memory and why is the byte buffer being "consumed"? 

I simply need to read the same array of byte multiple times, I don't need 
to "copy" them. I'm coming from a C background so I'm wondering what is 
going on behind the scenes as well.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: files, readers, byte arrays (slices?), byte buffers and http.requests

2016-07-02 Thread Sri G
Thanks for the pointer. I also found this helpful Asynchronously Split an 
io.Reader in Go (golang) « Rodaine 
<http://rodaine.com/2015/04/async-split-io-reader-in-golang/> but I'm still 
missing something.

Version 1: the uploaded file is 1024 bytes extra at the end (too big):

mimebuf := make([]byte, 1024)
_, err = file.Read(mimebuf)

mime := mimemagic.Match("", mimebuf)

fileReader := io.MultiReader(bytes.NewReader(mimebuf), file)

checksum := md5.New()

b := io.TeeReader(fileReader, checksum)

md5hex := hex.EncodeToString(checksum.Sum(nil))

// Save file
io.Copy(f, b)

Version 2: the uploaded file is truncated by 1024 byte (too small): (this 
makes sense since the first 1024 bytes of file was consumed)

mimebuf := make([]byte, 1024)
_, err = file.Read(mimebuf)

mime := mimemagic.Match("", mimebuf)

checksum := md5.New()

// Adding file.Seek(0,0) here does not fix this issue

b := io.TeeReader(file, checksum)

md5hex := hex.EncodeToString(checksum.Sum(nil))

// Save file
io.Copy(f, b)


What is incorrect which is causing this? How do I get the goldilocks 
version that's just right?

On Saturday, July 2, 2016 at 3:18:51 AM UTC-4, Tamás Gulácsi wrote:
>
>
> 2016. július 2., szombat 8:15:19 UTC+2 időpontban Sri G a következőt írta:
>>
>> I'm working on receiving uploads through a form.
>>
>> The tricky part is validation.
>>
>> I attempt to read the first 1024 bytes to check the mime of the file and 
>> then if valid read the rest and hash it and also save it to disk. Reading 
>> the mime type is successful and I've gotten it to work by chaining 
>> TeeReader but it seems very hackish. Whats the idiomatic way to do this?
>>
>> I'm trying something like this: 
>>
>>
>> // Parse my multi part form 
>> ...
>> // Get file handle
>> file, err := fh.Open()
>>
>> var a bytes.Buffer
>>
>> io.CopyN(&a, file, 1024)
>>
>> mime := mimemagic.Match("", a.Bytes())
>> // Check mime type (this works fine)
>>
>> I'm trying to seek a stream so this should be no-op
>> file.Seek(0, 0)
>>
>> The file stored on disk is 1KB larger than the original so it appears to 
>> be re-copying the entire file and appending it to bytes.Buffer
>> io.Copy(&a, file)
>>
>> checksum := md5.New()
>> b := io.TeeReader(&a, checksum)
>>
>> md5hex := hex.EncodeToString(checksum.Sum(nil))
>> fmt.Println("md5=", md5hex)
>>
>> //Open file f for writing to disk
>> ...
>> //Save file
>> io.Copy(f, b)
>>
>>
>> Checked the md5 of (1KB of orig + orig), and (orginal - first 1 KB), 
>> neither match the md5 of the file being hashed.
>>
>> Why can't I append the rest of the stream to the byte buffer to get the 
>> complete file in memory and why is the byte buffer being "consumed"? 
>>
>> I simply need to read the same array of byte multiple times, I don't need 
>> to "copy" them. I'm coming from a C background so I'm wondering what is 
>> going on behind the scenes as well.
>>
>
> If you know you'll have to read the whole file into memory, then do that, 
> and use bytes.NewReader to create  a reader for that byte slice.
>
> If you read partly, to decide whether to go on, then use fh.Read or 
> io.ReadAtLeast with a byte slice.
>
> If you read sth, then want to read the whole from the beginning, construct 
> a Reader with io.MultiReader(bytes.NewReader(b), fh).
>
> You can combine these approaches, but if the while file size is less than 
> a few KiB, I think it is easier, simpler and more performant (!) to read 
> the whole file up into memory,
> into a bytes.Buffer, and construct the needed readers with 
> bytes.NewReader(buf.Bytes()). 
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: files, readers, byte arrays (slices?), byte buffers and http.requests

2016-07-02 Thread Sri G
Update:

Adding file.Seek(0,0) does fix the issue in Version 2. The uploaded file is 
the correct size on disk with the correct md5. Without it, the uploaded 
file which is saved is missing the first 1024 bytes. This makes sense.

There is something wrong with the way the md5 is calculated, it keeps 
giving the same hash. Any ideas?

This version, while most likely not idiomatic, works:

mimebuf := make([]byte, 1024)
 _, err = file.Read(mimebuf)


mime := mimemagic.Match("", mimebuf)

file.Seek(0, 0)

checksum := md5.New()

io.Copy(checksum, file)

md5hex := hex.EncodeToString(checksum.Sum(nil))
fmt.Println("md5=", md5hex)

file.Seek(0, 0)
io.Copy(f, file)

It would be much appreciated if someone understands the idiomatic way to do 
this with and can explain it.

On Saturday, July 2, 2016 at 5:48:45 PM UTC-4, Sri G wrote:
>
> Thanks for the pointer. I also found this helpful Asynchronously Split an 
> io.Reader in Go (golang) « Rodaine 
> <http://rodaine.com/2015/04/async-split-io-reader-in-golang/> but I'm 
> still missing something.
>
> Version 1: the uploaded file is 1024 bytes extra at the end (too big):
>
> mimebuf := make([]byte, 1024)
> _, err = file.Read(mimebuf)
>
> mime := mimemagic.Match("", mimebuf)
>
> fileReader := io.MultiReader(bytes.NewReader(mimebuf), file)
>
> checksum := md5.New()
>
> b := io.TeeReader(fileReader, checksum)
>
> md5hex := hex.EncodeToString(checksum.Sum(nil))
>
> // Save file
> io.Copy(f, b)
>
> Version 2: the uploaded file is truncated by 1024 byte (too small): (this 
> makes sense since the first 1024 bytes of file was consumed)
>
> mimebuf := make([]byte, 1024)
> _, err = file.Read(mimebuf)
>
> mime := mimemagic.Match("", mimebuf)
>
> checksum := md5.New()
>
> // Adding file.Seek(0,0) here does not fix this issue
>
> b := io.TeeReader(file, checksum)
>
> md5hex := hex.EncodeToString(checksum.Sum(nil))
>
> // Save file
> io.Copy(f, b)
>
>
> What is incorrect which is causing this? How do I get the goldilocks 
> version that's just right?
>
> On Saturday, July 2, 2016 at 3:18:51 AM UTC-4, Tamás Gulácsi wrote:
>>
>>
>> 2016. július 2., szombat 8:15:19 UTC+2 időpontban Sri G a következőt írta:
>>>
>>> I'm working on receiving uploads through a form.
>>>
>>> The tricky part is validation.
>>>
>>> I attempt to read the first 1024 bytes to check the mime of the file and 
>>> then if valid read the rest and hash it and also save it to disk. Reading 
>>> the mime type is successful and I've gotten it to work by chaining 
>>> TeeReader but it seems very hackish. Whats the idiomatic way to do this?
>>>
>>> I'm trying something like this: 
>>>
>>>
>>> // Parse my multi part form 
>>> ...
>>> // Get file handle
>>> file, err := fh.Open()
>>>
>>> var a bytes.Buffer
>>>
>>> io.CopyN(&a, file, 1024)
>>>
>>> mime := mimemagic.Match("", a.Bytes())
>>> // Check mime type (this works fine)
>>>
>>> I'm trying to seek a stream so this should be no-op
>>> file.Seek(0, 0)
>>>
>>> The file stored on disk is 1KB larger than the original so it appears to 
>>> be re-copying the entire file and appending it to bytes.Buffer
>>> io.Copy(&a, file)
>>>
>>> checksum := md5.New()
>>> b := io.TeeReader(&a, checksum)
>>>
>>> md5hex := hex.EncodeToString(checksum.Sum(nil))
>>> fmt.Println("md5=", md5hex)
>>>
>>> //Open file f for writing to disk
>>> ...
>>> //Save file
>>> io.Copy(f, b)
>>>
>>>
>>> Checked the md5 of (1KB of orig + orig), and (orginal - first 1 KB), 
>>> neither match the md5 of the file being hashed.
>>>
>>> Why can't I append the rest of the stream to the byte buffer to get the 
>>> complete file in memory and why is the byte buffer being "consumed"? 
>>>
>>> I simply need to read the same array of byte multiple times, I don't 
>>> need to "copy" them. I'm coming from a C background so I'm wondering what 
>>> is going on behind the scenes as well.
>>>
>>
>> If you know you'll have to read the whole file into memory, then do that, 
>> and use bytes.NewReader to create  a reader for that byte slice.
>>
>> If you read partly, to decide whether to go on, then use fh.Read or 
>> io.ReadAtLeast with a byte slice.
>>
>> If you read sth, then want to read the whole from the beginning, 
>> construct a Reader with io.MultiReader(bytes.NewReader(b), fh).
>>
>> You can combine these approaches, but if the while file size is less than 
>> a few KiB, I think it is easier, simpler and more performant (!) to read 
>> the whole file up into memory,
>> into a bytes.Buffer, and construct the needed readers with 
>> bytes.NewReader(buf.Bytes()). 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: files, readers, byte arrays (slices?), byte buffers and http.requests

2016-08-03 Thread Sri G
Doh. Thanks. I did the setup but didnt click "execute".

Revisiting this because its now a bottleneck since it directly impact user 
experience (how long a request will take to process) and scalability 
(requests per second a single instance can handle). It wasn't pre-mature 
optimization, rather proper architecture planning :)

In C, the request would come into a ring buffer of Struct of Arrays (read 
SoAs vs AoS on Intel x86) -> a pointer to the post data is kept. This is 
used to check the mime type as well as compute the md5. Then it passed to 
be written to disk before it is released. No copies are needed.

How can I accomplish this in idiomatically Go? When I say idiomatic, I mean 
efficient in space, time and verbosity depending on the requirements and 
most importantly, not fighting the language. 

I'm having difficulty grokking whether a command copies data or uses a 
reference to the underlying buffer (pointer). Or does everything copy 
because data needs to be in each stack for each go routine? 

I've read the source code of io.copy, if there is a reader.ReadFrom or 
writer.WriteTo, the copy uses the existing buffer, avoiding allocation and 
a copy. However crypto/md5 does not have either of these, so its not 
possible to compute the md5 without copying data. Is this because the md5 
library is written for streaming data vs static data?

Is there a way to accomplish this? i.e. here's a buffer of data, compute 
the md5 on it.

Re: the mimetype, I should be able to create a 1024 byte slice of the file 
and pass it to mimemagic. This should avoid the copy.


On Saturday, July 2, 2016 at 9:27:15 PM UTC-4, Dave Cheney wrote:
>
> The hash is always the same because you ask for the hash value before 
> writing any data through it with io.Copy.


On Saturday, July 2, 2016 at 9:27:15 PM UTC-4, Dave Cheney wrote:
>
> The hash is always the same because you ask for the hash value before 
> writing any data through it with io.Copy. 

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Duplicate File Checker Performance

2016-10-15 Thread Sri G
I wrote a multi-threaded duplicate file checker using md5, here is the 
complete 
source: https://github.com/hbfs/dupe_check/blob/master/dupe_check.go

Benched two variants on the same machine, on the same set of files (~1.7GB 
folder with  ~600 files, each avg 3MB), multiple times, purging disk cache 
in between each run.

With this code:

hash := md5.New()

if _, err := io.Copy(hash, file); err != nil {
  fmt.Println(err)
}

var md5sum [md5.Size]byte
copy(md5sum[:], hash.Sum(nil)[:16])

*// 3.35s user 105.20s system 213% cpu 50.848 total, memory usage is ~ 30MB*


With this code:

  data, err := ioutil.ReadFile(path)
  if err != nil {
fmt.Println(err)
  }

  md5sum := md5.Sum(data)

* // 3.10s user 31.75s system 104% cpu 33.210 total, memory usage is ~ 
1.52GB*

The memory usage make sense, but why is the streaming version ~3x slower 
than the read the entire file into memory version? This trade off doesn't 
make sense to me since the file is being read from disk in both situations 
which should be the limiting factor. Then the md5sum is being computed. 

In the streaming version, there is an extra copy from []byte to [16]byte 
but that should be negligible.

My only theory I can think of is context switching

streaming version:
disk -> processor
processor waiting for disk read so it switch to read another file, sleeping 
the thread.

entire file:
disk -> memory -> processor
file is in memory so not as much context switching.

What do you think? Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: Duplicate File Checker Performance

2016-10-15 Thread Sri G
Too diagnose this issue, I tried some benchmarks with time tested tools:

On the same directory:

find DIR -type f -exec md5 {} \; 

*5.36s user 2.93s system 50% cpu 16.552 total*

Adding a hashmap on top of that wouldn't significantly increase the time.

Making this multi-processed (32 processes): 

find DIR -type f -print0 | xargs -0 -n 1 -P 32 md5

*5.32s user 3.24s system 43% cpu 19.503 total*

With 64 processes, like GOMAXPROCS=64 on this machine.

find DIR -type f -print0 | xargs -0 -n 1 -P 64 md5


*5.31s user 3.66s system 42% cpu 20.999 total*
So it seems disk access is the bottleneck as it should be and the biggest 
performance hit comes from the synchronization

I wrote a python script to do the same, code is 
here: https://github.com/hbfs/dupe_check/blob/master/dupe_check.py

*2.97s user 0.92s system 24% cpu 15.590 total, memory usage is ~ 8MB*

My next step is to try a single threaded/goroutine version in Go to 
replicate this level of performance and get a deeper understand of how Go 
is built and how to use it more effectively. Advice appreciated!

On Saturday, October 15, 2016 at 5:15:29 AM UTC-4, Sri G wrote:
>
> I wrote a multi-threaded duplicate file checker using md5, here is the 
> complete source: 
> https://github.com/hbfs/dupe_check/blob/master/dupe_check.go
>
> Benched two variants on the same machine, on the same set of files (~1.7GB 
> folder with  ~600 files, each avg 3MB), multiple times, purging disk cache 
> in between each run.
>
> With this code:
>
> hash := md5.New()
>
> if _, err := io.Copy(hash, file); err != nil {
>   fmt.Println(err)
> }
>
> var md5sum [md5.Size]byte
> copy(md5sum[:], hash.Sum(nil)[:16])
>
> *// 3.35s user 105.20s system 213% cpu 50.848 total, memory usage is ~ 
> 30MB*
>
>
> With this code:
>
>   data, err := ioutil.ReadFile(path)
>   if err != nil {
> fmt.Println(err)
>   }
>
>   md5sum := md5.Sum(data)
>
> * // 3.10s user 31.75s system 104% cpu 33.210 total, memory usage is ~ 
> 1.52GB*
>
> The memory usage make sense, but why is the streaming version ~3x slower 
> than the read the entire file into memory version? This trade off doesn't 
> make sense to me since the file is being read from disk in both situations 
> which should be the limiting factor. Then the md5sum is being computed. 
>
> In the streaming version, there is an extra copy from []byte to [16]byte 
> but that should be negligible.
>
> My only theory I can think of is context switching
>
> streaming version:
> disk -> processor
> processor waiting for disk read so it switch to read another file, 
> sleeping the thread.
>
> entire file:
> disk -> memory -> processor
> file is in memory so not as much context switching.
>
> What do you think? Thanks!
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: Duplicate File Checker Performance

2016-10-15 Thread Sri G
Thanks. Made the go code similar to python using CopyBuffer with a block 
size of 65536. 

buf := make([]byte, 65536)

if _, err := io.CopyBuffer(hash, file, buf); err != nil {
fmt.Println(err)
}

Didn't make too much of a difference, was slightly faster.

What got it to the same place was running ComputeHash in the same goroutine 
as the Walk function vs its own go routine for each file

+ComputeHash(path, info, queue, wg)
-go ComputeHash(path, info, queue, wg)


*2.88s user 0.98s system 23% cpu 16.086 total, memory usage ~ 7MB*

Here's the before and after pprof webs:

BEFORE with 'go ComputeHash(...):




AFTER with 'ComputeHash(...):




Since disk read are SOO much slower, computing the hash for each file in 
its own goroutine caused a huge slowdown.. 

btw this is on a RAID10, with SSD: 

Old code SSD: 3.31s user 17.87s system 244% cpu 8.667 total

New code SDD:* 2.88s user 0.84s system 69% cpu 5.369 total*

Shows you can throw hardware at a problem BUT the old code locks up my 
system momentarily..


On Saturday, October 15, 2016 at 3:27:38 PM UTC-4, Kevin Malachowski wrote:
>
> Sorry, I meant that calling Write on the hash type might be slower if it's 
> called more often.
>
> (I'm on mobile right now. When I get back to a keyboard I'll try to come 
> up with an example)
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [go-nuts] Re: Duplicate File Checker Performance

2016-10-16 Thread Sri G
This isn't exactly the same because I deleted some files but it shouldn't 
really matter.  

Switched to md5..

--- a/dup.go
+++ b/dup.go
@@ -3,7 +3,7 @@ package main
-   "crypto/sha256"
+   "crypto/md5"

@@ -207,8 +207,8 @@ func main() {
+type Hash [16]byte // appropriate for MD5
+// type Hash [32]byte // appropriate for SHA-256

 func hashFile(p string, hash []byte, prefix int64) (count int64) {
@@ -221,8 +221,8 @@ func hashFile(p string, hash []byte, prefix int64) 
(count int64) {
+   hasher := md5.New() // select MD5 in concert with "Hash" above
+   // hasher := sha256.New() // select SHA-256 in concert with "Hash" 
above


Checking only same sized files is huge speed up (82x less bytes checked)

016/10/16  14:33:51  total:  566 files ( 100.00%),1667774744 
bytes ( 100.00%)
2016/10/16 14:33:51   examined:9 files (   1.59%),  20271440 
bytes (   1.22%) in 0.4209 seconds
2016/10/16 14:33:51 duplicates:9 files (   1.59%),  20271440 
bytes (   1.22%)

Checking the first 4KB of files and only hashing if they are the same is 
another cool optimization (check avg. 768x less bytes in my case). Really 
nice Michael.

With workers = 8:

RAID10:* 0.05s user 0.04s system 37% cpu 0.231 total, couldn't check memory 
usage but its probably negligible*
SSD:*0.05s user 0.04s system 59% cpu 0.137 total*

Since SSD's and my filesystem are optimized for 4K random reads, it makes 
sense to use multiple threads/goroutines.

Optimal # of workers=9 on RAID 10: *0.05s user 0.04s system 40% cpu 0.220 
total*
on SSD workers = 8~9:  *0.04s user 0.04s system 68% cpu 0.117 
total*

Not so much when you're doing a full sequential read. Because I use the md5 
for other purposes, the entire file must be hashed, so sadly I cant use 
these optimizations.

On Sunday, October 16, 2016 at 1:26:24 PM UTC-4, Michael Jones wrote:
>
> Sri G,
>
>  
>
> How does this time compare to my “Dup” program? I can’t test for you…since 
> it is your filesystem…but I thought I had it going about as fast as 
> possible a few years ago when I wrote that one.
>
>  
>
> https://github.com/MichaelTJones/dup
>
>  
>
> Michael
>
>  
>
> *From: *> on behalf of Sri G <
> sriakhil...@gmail.com >
> *Date: *Saturday, October 15, 2016 at 6:46 PM
> *To: *golang-nuts >
> *Subject: *[go-nuts] Re: Duplicate File Checker Performance
>
>  
>
> Thanks. Made the go code similar to python using CopyBuffer with a block 
> size of 65536. 
>
>  
>
> buf := make([]byte, 65536)
>
> 
>
> if _, err := io.CopyBuffer(hash, file, buf); err != nil {
>
> fmt.Println(err)
>
> }
>
>  
>
> Didn't make too much of a difference, was slightly faster.
>
>  
>
> What got it to the same place was running ComputeHash in the same 
> goroutine as the Walk function vs its own go routine for each file
>
>  
>
> +ComputeHash(path, info, queue, wg)
>
> -go ComputeHash(path, info, queue, wg)
>
>  
>
>  
>
> *2.88s user 0.98s system 23% cpu 16.086 total, memory usage ~ 7MB*
>
> Here's the before and after pprof webs:
>
>  
>
> BEFORE with 'go ComputeHash(...):
>
>  
>
>
> <https://lh3.googleusercontent.com/-aRKwq1P9_ec/WALXpkl_yxI/DFY/WXn0PDcOw_Mk909yNp9Hh1tWUl0PlSVJACLcB/s1600/prof.cpu-with-go_compute_hash.png>
>
>  
>
>  
>
> AFTER with 'ComputeHash(...):
>
>  
>
>
> <https://lh3.googleusercontent.com/-8LnYMr_UhOg/WALXsbuxkjI/DFc/EAk7vOvl2zMJZARfcz2JpgXmZc_3YfFKwCLcB/s1600/prof.cpu-no-go.png>
>
>  
>
>  
>
> Since disk read are SOO much slower, computing the hash for each file in 
> its own goroutine caused a huge slowdown.. 
>
>  
>
> btw this is on a RAID10, with SSD: 
>
>  
>
> Old code SSD:* 3.31s user 17.87s system 244% cpu 8.667 total*
>
>  
>
> New code SDD:* 2.88s user 0.84s system 69% cpu 5.369 total*
>
>  
>
> Shows you can throw hardware at a problem BUT the old code locks up my 
> system momentarily..
>
>  
>
>
> On Saturday, October 15, 2016 at 3:27:38 PM UTC-4, Kevin Malachowski wrote:
>
> Sorry, I meant that calling Write on the hash type might be slower if it's 
> called more often.
>
> (I'm on mobile right now. When I get back to a keyboard I'll try to come 
> up with an example)
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts...@googlegroups.com .
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [go-nuts] Re: Duplicate File Checker Performance

2016-10-21 Thread Sri G
Yea :/

Appreciate you sharing your project and your code! I learned a lot of 
useful Go patterns (referencing a fixed size byte buffer as a slice) and 
how to re-use byte buffers like in the python version to keep memory usage 
down.

On Sunday, October 16, 2016 at 4:32:34 PM UTC-4, Michael Jones wrote:
>
> Oh, I see. Well if you must read and hash every byte of every file then 
> you really are mostly measuring device speed.
>
>  
>
> *From: *> on behalf of Sri G <
> sriakhil...@gmail.com >
> *Date: *Sunday, October 16, 2016 at 12:17 PM
> *To: *golang-nuts >
> *Cc: *>
> *Subject: *Re: [go-nuts] Re: Duplicate File Checker Performance
>
>  
>
> This isn't exactly the same because I deleted some files but it shouldn't 
> really matter.  
>
>  
>
> Switched to md5..
>
>  
>
> --- a/dup.go
>
> +++ b/dup.go
>
> @@ -3,7 +3,7 @@ package main
>
> -   "crypto/sha256"
>
> +   "crypto/md5"
>
>  
>
> @@ -207,8 +207,8 @@ func main() {
>
> +type Hash [16]byte // appropriate for MD5
>
> +// type Hash [32]byte // appropriate for SHA-256
>
>  
>
>  func hashFile(p string, hash []byte, prefix int64) (count int64) {
>
> @@ -221,8 +221,8 @@ func hashFile(p string, hash []byte, prefix int64) 
> (count int64) {
>
> +   hasher := md5.New() // select MD5 in concert with "Hash" above
>
> +   // hasher := sha256.New() // select SHA-256 in concert with "Hash" 
> above
>
>  
>
>  
>
> Checking only same sized files is huge speed up (82x less bytes checked)
>
>  
>
> 016/10/16  14:33:51  total:  566 files ( 100.00%),1667774744 
> bytes ( 100.00%)
>
> 2016/10/16 14:33:51   examined:9 files (   1.59%),  20271440 
> bytes (   1.22%) in 0.4209 seconds
>
> 2016/10/16 14:33:51 duplicates:9 files (   1.59%),  20271440 
> bytes (   1.22%)
>
>  
>
> Checking the first 4KB of files and only hashing if they are the same is 
> another cool optimization (check avg. 768x less bytes in my case). Really 
> nice Michael.
>
>  
>
> With workers = 8:
>
>  
>
> RAID10:* 0.05s user 0.04s system 37% cpu 0.231 total, couldn't check 
> memory usage but its probably negligible*
>
> SSD:*0.05s user 0.04s system 59% cpu 0.137 total*
>
>  
>
> Since SSD's and my filesystem are optimized for 4K random reads, it makes 
> sense to use multiple threads/goroutines.
>
>  
>
> Optimal # of workers=9 on RAID 10: *0.05s user 0.04s system 40% cpu 0.220 
> total*
>
> on SSD workers = 8~9:  *0.04s user 0.04s system 68% cpu 0.117 
> total*
>
>  
>
> Not so much when you're doing a full sequential read. Because I use the 
> md5 for other purposes, the entire file must be hashed, so sadly I cant use 
> these optimizations.
>
>
> On Sunday, October 16, 2016 at 1:26:24 PM UTC-4, Michael Jones wrote:
>
> Sri G,
>
>  
>
> How does this time compare to my “Dup” program? I can’t test for you…since 
> it is your filesystem…but I thought I had it going about as fast as 
> possible a few years ago when I wrote that one.
>
>  
>
> https://github.com/MichaelTJones/dup
>
>  
>
> Michael
>
>  
>
> *From: * on behalf of Sri G <
> sriakhil...@gmail.com>
> *Date: *Saturday, October 15, 2016 at 6:46 PM
> *To: *golang-nuts 
> *Subject: *[go-nuts] Re: Duplicate File Checker Performance
>
>  
>
> Thanks. Made the go code similar to python using CopyBuffer with a block 
> size of 65536. 
>
>  
>
> buf := make([]byte, 65536)
>
> 
>
> if _, err := io.CopyBuffer(hash, file, buf); err != nil {
>
> fmt.Println(err)
>
> }
>
>  
>
> Didn't make too much of a difference, was slightly faster.
>
>  
>
> What got it to the same place was running ComputeHash in the same 
> goroutine as the Walk function vs its own go routine for each file
>
>  
>
> +ComputeHash(path, info, queue, wg)
>
> -go ComputeHash(path, info, queue, wg)
>
>  
>
>  
>
> *2.88s user 0.98s system 23% cpu 16.086 total, memory usage ~ 7MB*
>
> Here's the before and after pprof webs:
>
>  
>
> BEFORE with 'go ComputeHash(...):
>
>  
>
>
> <https://lh3.googleusercontent.com/-aRKwq1P9_ec/WALXpkl_yxI/DFY/WXn0PDcOw_Mk909yNp9Hh1tWUl0PlSVJACLcB/s1600/prof.cpu-with-go_compute_hash.png>
>
>  
>
>  
>
> AFTER with 'ComputeHash(...):
>
>  
>
>
> <https://lh3.googleusercontent.com/-8LnYMr_UhOg/WALXsbuxkjI/DFc/EAk7vOvl2zMJZARfcz2JpgXmZc_3YfFKwCLcB/s1600/pro

[go-nuts] Re: Golang should have a center packages index hosting like npm, rust crates

2016-10-21 Thread Sri G
Pros: 
+ Getting stats on popular packages and code. Keeping anonymity should be 
like apt on debian/ubuntu by requesting permission for anonymous stats 
reporting.
+ Showing Go's popularity and increasing community and adoption

Cons: 
- Having a central point of failure for pulling repos.. Look at the massive 
DDoS on Dyn right now taking out quite a few major web platforms (Twitter, 
Spotify, Reddit, New York Times, WIRED.com, etc)
- Central repos are often terribly slow, creating local mirrors is not easy
- Not suitable for automated builds / testing without reliable mirrors

Adding anonymous reporting to go get and a website to show results would 
solve this nicely without creating a bottleneck.

On Thursday, October 20, 2016 at 12:05:03 AM UTC-4, zixu mo wrote:
>
> Golang should have a center packages index hosting like npm, rust crates.
>
>
>
> For Rust : 
> https://crates.io/
>
> For JS:
> https://www.npmjs.com/
>
> For PHP
> https://packagist.org/
>
> For Ruby
> https://rubygems.org/
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.