So you are working with non-overlapping infixes? _9[\'abcdefghijklmnopqrstuvwxyz' abcdefghi jklmnopqr stuvwxyz
Thanks, -- Raul On Tue, Feb 3, 2015 at 5:47 PM, Joe Bogner <[email protected]> wrote: > Raul, Bob - I tried the blocked approach earlier on and realized the > infixes across boundaries would be a bit of a challenge. It was still > slower than I needed so I didn't fix up the boundary logic. It takes > 84 seconds with a 1e6 block size and 43 seconds with a 10e6 block size > and around 43 seconds still for 20e6 block size, still under the > memory limit too which is good. A 50e6 block size puts me over. It's > still faster than the memr method but I think that's mainly due to the > loop. > > Devon - thanks for the link. I've been watching it on the Recent > Changes. I may try to apply it later > > Code I was testing: > > calc3 =: 4 : 0 > blocks=: x[\y > uniq=:'' > lim=.(# blocks) > for_i. (i. lim) do. > smoutput i > extra=.'' > NB. get current block and subset from next > if. lim>(>:i) do. > extra=.(20{. (i+1)&{::blocks) > end. > blocktext=: (i{::blocks),extra > uniq=: ~. (uniq, ( 9[\ blocktext)) > end. > # uniq > ) > > You can see how it produces the incorrect results -- I would need to > fix up the boundary logic.It's close on the large input. The correct > answer is 2,482,912 and this calculates 2482921. On a small input it's > clearly wrong > > _9 calc3 'abcdefghijklmnopqrstuvwxyz' > > Produces 35 > > Whereas the correct answer is: 3 > > # ~. _9[\'abcdefghijklmnopqrstuvwxyz' > 3 > > Oh well > > On Tue, Feb 3, 2015 at 4:44 PM, Devon McCormick <[email protected]> wrote: >> I've been updating work on my adverb for working with large files - >> http://www.jsoftware.com/jwiki/DevonMcCormick/Code/WorkOnLargeFiles - that >> may provide a general way to apply an arbitrary verb across a large file. >> >> On Tue, Feb 3, 2015 at 4:25 PM, Raul Miller <[email protected]> wrote: >> >>> Another approach, depending on what the processing looks like, might >>> involve selecting indices based on something smaller than nine >>> characters. For example, if certain character pairs were "valid" or >>> maybe even "better than others" you could get indices which select >>> those pairs and further refine from there. >>> >>> But the blocked approach can also be a very good one. (Just make sure >>> to handle infixes which span block boundaries.) >>> >>> Thanks, >>> >>> -- >>> Raul >>> >>> On Tue, Feb 3, 2015 at 3:54 PM, Robert Bernecky >>> <[email protected]> wrote: >>> > I agree that you need to have more than one cell on hand to do the nub! >>> > >>> > However, try this approach. It's crude, but it might match or beat C: >>> > >>> > - decide how much input you can process at once. >>> > - Result =. '' >>> > - while more hunks, loop over: >>> > - hunkresult = nub-of-infix on current hunk of the input. >>> > - Result = nub Result,hunkresult >>> > >>> > This lets J do most of the heavy lifting, while keeping you >>> > within the memory constraints. >>> > >>> > Bob >>> > >>> > On 15-02-03 03:45 PM, Joe Bogner wrote: >>> >> >>> >> Sorry for the double email, I realized I replied to you directly. I'm >>> >> replying to you directly since you replied to me directly, just in >>> >> case you didn't want to send it to chat... >>> >> >>> >> In any case, the whole goal is to get the nub of the 9 character >>> >> infixes. There is special code for some of the mathematical operations >>> >> on infixes (such as sum) that don't generate all the infixes, however >>> >> I having found anything for nub (distinct), which makes some sense >>> >> since it can't be reduced to an atom like the math operations >>> >> >>> >> Thanks for the ideas >>> >> >>> >> >>> >> On Tue, Feb 3, 2015 at 3:42 PM, Robert Bernecky >>> >> <[email protected]> wrote: >>> >>> >>> >>> "9 character infixes" - that looks like an input to a convolution, does >>> >>> it >>> >>> not? >>> >>> If so, then there should be some J dohickey (adverb or conjunction) >>> >>> (My J is very very rusty, and likely obsolete...) >>> >>> to do that that will invoke your kernel verb without generating all the >>> >>> infixes. >>> >>> >>> >>> By convolution here, I mean something like moving-window inner >>> >>> product, string search, or, err, convolution. >>> >>> >>> >>> What processing do you apply to each infix? >>> >>> >>> >>> Bob >>> >>> >>> >>> >>> >>> >>> >>> On 15-02-03 03:36 PM, Joe Bogner wrote: >>> >>>> >>> >>>> Thanks Bob. I agree that compiling often helps. I have previously >>> >>>> kicked around the idea of generating byte code or machine code and >>> >>>> interpreting it or executing it. I have called dlls in other >>> >>>> circumstances, but it would be nice to have everything in J. >>> >>>> >>> >>>> In this case, the challenge put a 2GB constraint on memory and 60 >>> >>>> seconds of execution time. It's an artificial constraint, which is why >>> >>>> I posted to chat. Those constraint wouldn't exist in the real world. >>> >>>> >>> >>>> I'm scanning a 240MB text file and generating 9 character infixes and >>> >>>> processing them. I hit the memory limit using 9]\ so I figured I'd try >>> >>>> looping and using memr (15!:1). In an earlier version, about 270 >>> >>>> seconds were spent in the looping construct and 40 seconds in memr. It >>> >>>> seemed like there should be a faster looping construct >>> >>>> >>> >>>> Here's a simple version: >>> >>>> >>> >>>> txt=: a. {~ (97&+(i. 26)) >>> >>>> addr=: mema (# txt) >>> >>>> txt memw addr,0,(#txt) >>> >>>> 6!:2 '([:<:([:0&{::(] ; smoutput&([:15!:1(addr,9,~<:@:])) ))) ^:] >>> >>>> (<:#txt)' >>> >>>> >>> >>>> >>> >>>> yz >>> >>>> xyz >>> >>>> wxyz >>> >>>> vwxyz >>> >>>> uvwxyz >>> >>>> tuvwxyz >>> >>>> stuvwxyz >>> >>>> rstuvwxyz >>> >>>> qrstuvwxy >>> >>>> pqrstuvwx >>> >>>> opqrstuvw >>> >>>> nopqrstuv >>> >>>> mnopqrstu >>> >>>> lmnopqrst >>> >>>> klmnopqrs >>> >>>> jklmnopqr >>> >>>> ijklmnopq >>> >>>> hijklmnop >>> >>>> ghijklmno >>> >>>> fghijklmn >>> >>>> efghijklm >>> >>>> defghijkl >>> >>>> cdefghijk >>> >>>> bcdefghij >>> >>>> abcdefghi >>> >>>> >>> >>>> >>> >>>> I would need more processing of the results, but you can see roughly >>> >>>> how it compares to 9]\ >>> >>>> >>> >>>> 9[\txt >>> >>>> abcdefghi >>> >>>> bcdefghij >>> >>>> cdefghijk >>> >>>> defghijkl >>> >>>> efghijklm >>> >>>> fghijklmn >>> >>>> ghijklmno >>> >>>> hijklmnop >>> >>>> ijklmnopq >>> >>>> jklmnopqr >>> >>>> klmnopqrs >>> >>>> lmnopqrst >>> >>>> mnopqrstu >>> >>>> nopqrstuv >>> >>>> opqrstuvw >>> >>>> pqrstuvwx >>> >>>> qrstuvwxy >>> >>>> rstuvwxyz >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> On Tue, Feb 3, 2015 at 3:05 PM, Robert Bernecky >>> >>>> <[email protected]> wrote: >>> >>>>> >>> >>>>> Compiling often helps. When I compile looping, scalar-dominated >>> >>>>> APL code, I generally get about a factor of 1000X speedup. >>> >>>>> You should get the same sort of speedup with compiled J. >>> >>>>> [No, I do not have a J compiler, but will be happy to create one >>> >>>>> if funding were to materialize.] >>> >>>>> >>> >>>>> However, you mentioned that you were "looping due to memory >>> >>>>> constraints...". Is it possible that a different algorithm would let >>> >>>>> you do less looping? E.g., looping over 1e4 elements at once, >>> >>>>> or using a different approach, such as dynamic programming? >>> >>>>> >>> >>>>> Bob >>> >>>>> >>> >>>>> >>> >>>>> On 15-02-03 02:53 PM, Joe Bogner wrote: >>> >>>>>> >>> >>>>>> I'm playing around with a coding challenge that I'm solving with a >>> >>>>>> loop that needs to execute over 200 million times. >>> >>>>>> >>> >>>>>> Setting aside whether that's a smart approach, what is the quickest >>> >>>>>> looping construct in J? I realize that J is not optimized for this >>> >>>>>> style of programming, but I'm still curious if it can be done >>> >>>>>> efficiently. >>> >>>>>> >>> >>>>>> The quickest I've found takes about 7 seconds to iterate 100 million >>> >>>>>> times, whereas the same loop in C takes 0.03 seconds. >>> >>>>>> >>> >>>>>> 6!:2 '(] [ ])^:] 100e6' >>> >>>>>> 7.00952 >>> >>>>>> >>> >>>>>> The train is taking up some time, but eventually I will need it >>> >>>>>> >>> >>>>>> No train - 2.7 seconds >>> >>>>>> >>> >>>>>> 6!:2 ']^:] 100e6' >>> >>>>>> 2.70656 >>> >>>>>> >>> >>>>>> >>> >>>>>> Explicit: >>> >>>>>> >>> >>>>>> loop=: 3 : 0 >>> >>>>>> ctr=:0 >>> >>>>>> while. 1 do. >>> >>>>>> ctr=:>:ctr >>> >>>>>> if. y < ctr do. >>> >>>>>> smoutput 'done' >>> >>>>>> return. >>> >>>>>> end. >>> >>>>>> end. >>> >>>>>> ) >>> >>>>>> >>> >>>>>> 6!:2 'loop 100e6' >>> >>>>>> 122.48 >>> >>>>>> >>> >>>>>> Clearly explicit isn't the way to go >>> >>>>>> >>> >>>>>> Back to the tacit: >>> >>>>>> >>> >>>>>> log=:smoutput bind ] >>> >>>>>> >>> >>>>>> ([: <: [: 0&{:: ] ; log )^:] 10 >>> >>>>>> 10 >>> >>>>>> 9 >>> >>>>>> 8 >>> >>>>>> 7 >>> >>>>>> 6 >>> >>>>>> 5 >>> >>>>>> 4 >>> >>>>>> 3 >>> >>>>>> 2 >>> >>>>>> 1 >>> >>>>>> 0 >>> >>>>>> >>> >>>>>> I would replace log with my function that uses the counter value >>> >>>>>> >>> >>>>>> That's pretty slow though on 100 million iterations - 88 seconds >>> >>>>>> >>> >>>>>> 6!:2 '([: <: [: 0&{:: ] ; ] )^:] 100e6' >>> >>>>>> 88.3644 >>> >>>>>> >>> >>>>>> Let's try a new log >>> >>>>>> >>> >>>>>> log=: ] <:@:[ (smoutput bind ]) >>> >>>>>> log 55 >>> >>>>>> 55 >>> >>>>>> 54 >>> >>>>>> >>> >>>>>> >>> >>>>>> That looks like it could work. Let's put a dummy in for now >>> >>>>>> >>> >>>>>> log=: ] <:@:[ ] >>> >>>>>> >>> >>>>>> 6!:2 'log^:] 100e6' >>> >>>>>> 40.9542 >>> >>>>>> >>> >>>>>> Still 40 seconds... Any ideas on how to speed up iteration like >>> this? >>> >>>>>> >>> >>>>>> >>> >>>>>> Sidenote: I'm looping due to memory constraints placed on the coding >>> >>>>>> challenge >>> >>>>>> >>> ---------------------------------------------------------------------- >>> >>>>>> For information about J forums see >>> http://www.jsoftware.com/forums.htm >>> >>>>>> >>> >>>>> -- >>> >>>>> Robert Bernecky >>> >>>>> Snake Island Research Inc >>> >>>>> 18 Fifth Street >>> >>>>> Ward's Island >>> >>>>> Toronto, Ontario M5J 2B9 >>> >>>>> >>> >>>>> [email protected] >>> >>>>> tel: +1 416 203 0854 >>> >>>>> >>> >>> >>> >>> -- >>> >>> Robert Bernecky >>> >>> Snake Island Research Inc >>> >>> 18 Fifth Street >>> >>> Ward's Island >>> >>> Toronto, Ontario M5J 2B9 >>> >>> >>> >>> [email protected] >>> >>> tel: +1 416 203 0854 >>> >>> >>> > >>> > >>> > -- >>> > Robert Bernecky >>> > Snake Island Research Inc >>> > 18 Fifth Street >>> > Ward's Island >>> > Toronto, Ontario M5J 2B9 >>> > >>> > [email protected] >>> > tel: +1 416 203 0854 >>> > >>> > ---------------------------------------------------------------------- >>> > For information about J forums see http://www.jsoftware.com/forums.htm >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >>> >> >> >> >> -- >> Devon McCormick, CFA >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
