[go-nuts] Re: Large 2D slice performance question

Mandolyte Fri, 09 Dec 2016 07:54:53 -0800

This worked out well. I was able to "materialize" over 1200 trees in under 
an hour. Pretty amazing. I didn't end up using the DenseSet code. Your 
quick and dirty version only handled cycles starting from start value. But 
other than that the concept worked out very well.


I did try a version with multiple go routines, but found that they hurt 
performance. It was much faster without them. I think the learning here is 
that if the code is solely working in-memory, then go routines/channels 
aren't going to help. In fact, a few times I had strange messages come out 
on the console, which I think were due to starved channels. 

The full test resulted in over 700M paths being generated... impressive!

Thanks very much for the advice!

On Thursday, December 1, 2016 at 6:45:02 AM UTC-5, Mandolyte wrote:
>
> Wow, this will take some time to digest. Regrettably I have required 
> training today and won't be able to even play with this until tomorrow.
>
> In my (parent,child), there are 304K unique parent values and 1.05M unique 
> child values. Of course many child values are also parent values. Thus, 
> total names are only just over 1.14M.
>
> This data came from and will go back to a database and it must be correct 
> CSV. And I know there is data that needs to be escaped per RFC 4180. Also, 
> I had always thought that the CSV package was buffered since it has a 
> Flush() method (and if not used, you will lose data!). At any rate, I'm 
> reluctant to roll my own here since I've been bitten too many times by the 
> data content.
>
> Thanks for taking the time to review this!
>
> On Thursday, December 1, 2016 at 4:21:42 AM UTC-5, Egon wrote:
>>
>> See whether this works better: 
>> https://gist.github.com/egonelbre/d94ea561c3e63db009718e227e506b5b 
>> <https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Fegonelbre%2Fd94ea561c3e63db009718e227e506b5b&sa=D&sntz=1&usg=AFQjCNFNg8SS63IgRFvzKRY9YYifLjdsMg>
>>
>> *There is a lot of room for improvements, (e.g. for your actual dataset 
>> increase defaultNameCount).*
>>
>> *PS: Avoid csv package for writing files, use bufio and Write directly... 
>> csv does some extra stuff, that you probably won't need.*
>>
>> btw. how many different names do you have?
>>
>> + Egon
>>
>> On Thursday, 1 December 2016 02:51:49 UTC+2, Mandolyte wrote:
>>>
>>> Thanks for the discussion! Package with tester is at:
>>> https://github.com/mandolyte/TableRecursion
>>>
>>> While I can't share the data, I could take a sample set of paths for a 
>>> root node, reverse engineer the pairs, and obfuscate... I've done this sort 
>>> of thing before but it is a bit of work. So I'll treat that as a last 
>>> resort. :-)
>>>
>>> On Wednesday, November 30, 2016 at 5:36:41 PM UTC-5, Mandolyte wrote:
>>>>
>>>> The finite set idea might work, but the set is well over 300K. The 
>>>> strings (part numbers) are not regular.
>>>>
>>>> I could make a single pass over the "parent" column and record in a 
>>>> map[string]int the index of the first occurrence. Then I would avoid 
>>>> sort.Search() having to find it each time. Or use sort.Search() on the 
>>>> smaller 300K slice if map performance was a problem...
>>>>
>>>> There is a lot of repetition in the "branches" - perhaps some caching 
>>>> would be appropriate...
>>>>
>>>> thanks for the ideas!
>>>>
>>>> On Wednesday, November 30, 2016 at 9:31:56 AM UTC-5, adon...@google.com 
>>>> wrote:
>>>>>
>>>>> On Wednesday, 30 November 2016 03:37:55 UTC+2, Mandolyte wrote:
>>>>>>>
>>>>>>> I have a fairly large 2D slice of strings, about 115M rows. These 
>>>>>>> are (parent, child) pairs that, processed recursively, form a tree. I 
>>>>>>> am 
>>>>>>> "materializing" all possible trees as well as determining for each root 
>>>>>>> node all child nodes for all levels for it.
>>>>>>>
>>>>>>  
>>>>> In addition to what Egon said:
>>>>>
>>>>> If the elements of the outer [][]string slice are all []string slices 
>>>>> of length 2, it would be much more efficient to use [2]string instead, 
>>>>> that 
>>>>> is, fixed-sized arrays of two strings, since this would avoid a lot of 
>>>>> memory allocation and pointer indirection.  The type of the outer slice 
>>>>> would become [][2]string.
>>>>>
>>>>> Also, if all your strings are drawn from a finite set, you'll find 
>>>>> many operations (such as comparison or set membership tests) much more 
>>>>> efficient if you represent each string by its index in some side table. 
>>>>>  Then instead of [2]string you can use [2]uint32, or perhaps even a 
>>>>> narrower integer type, which will make your main array more compact by at 
>>>>> least a factor of 4.
>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[go-nuts] Re: Large 2D slice performance question

Reply via email to