So you are working with non-overlapping infixes?

   _9[\'abcdefghijklmnopqrstuvwxyz'
abcdefghi
jklmnopqr
stuvwxyz

Thanks,

-- 
Raul

On Tue, Feb 3, 2015 at 5:47 PM, Joe Bogner <[email protected]> wrote:
> Raul, Bob - I tried the blocked approach earlier on and realized the
> infixes across boundaries would be a bit of a challenge. It was still
> slower than I needed so I didn't fix up the boundary logic. It takes
> 84 seconds with a 1e6 block size and 43 seconds with a 10e6 block size
> and around 43 seconds still for 20e6 block size, still under the
> memory limit too which is good.  A 50e6 block size puts me over. It's
> still faster than the memr method but I think that's mainly due to the
> loop.
>
> Devon - thanks for the link. I've been watching it on the Recent
> Changes. I may try to apply it later
>
> Code I was testing:
>
> calc3 =: 4 : 0
> blocks=: x[\y
> uniq=:''
> lim=.(# blocks)
> for_i. (i. lim) do.
> smoutput i
> extra=.''
> NB. get current block and subset from next
> if. lim>(>:i) do.
> extra=.(20{. (i+1)&{::blocks)
> end.
> blocktext=: (i{::blocks),extra
> uniq=: ~. (uniq, ( 9[\ blocktext))
> end.
> # uniq
> )
>
> You can see how it produces the incorrect results -- I would need to
> fix up the boundary logic.It's close on the large input. The correct
> answer is 2,482,912 and this calculates 2482921. On a small input it's
> clearly wrong
>
> _9 calc3 'abcdefghijklmnopqrstuvwxyz'
>
> Produces 35
>
> Whereas the correct answer is: 3
>
> # ~. _9[\'abcdefghijklmnopqrstuvwxyz'
> 3
>
> Oh well
>
> On Tue, Feb 3, 2015 at 4:44 PM, Devon McCormick <[email protected]> wrote:
>> I've been updating work on my adverb for working with large files -
>> http://www.jsoftware.com/jwiki/DevonMcCormick/Code/WorkOnLargeFiles - that
>> may provide a general way to apply an arbitrary verb across a large file.
>>
>> On Tue, Feb 3, 2015 at 4:25 PM, Raul Miller <[email protected]> wrote:
>>
>>> Another approach, depending on what the processing looks like, might
>>> involve selecting indices based on something smaller than nine
>>> characters. For example, if certain character pairs were "valid" or
>>> maybe even "better than others" you could get indices which select
>>> those pairs and further refine from there.
>>>
>>> But the blocked approach can also be a very good one. (Just make sure
>>> to handle infixes which span block boundaries.)
>>>
>>> Thanks,
>>>
>>> --
>>> Raul
>>>
>>> On Tue, Feb 3, 2015 at 3:54 PM, Robert Bernecky
>>> <[email protected]> wrote:
>>> > I agree that you need to have more than one cell on hand to do the nub!
>>> >
>>> > However,  try this approach. It's crude, but it might match or beat C:
>>> >
>>> >    - decide how much input you can process at once.
>>> >    - Result =. ''
>>> >    - while more hunks, loop over:
>>> >        -    hunkresult = nub-of-infix on current hunk of the input.
>>> >        -    Result = nub Result,hunkresult
>>> >
>>> > This lets J do most of the heavy lifting, while keeping you
>>> > within the memory constraints.
>>> >
>>> > Bob
>>> >
>>> > On 15-02-03 03:45 PM, Joe Bogner wrote:
>>> >>
>>> >> Sorry for the double email, I realized I replied to you directly. I'm
>>> >> replying to you directly since you replied to me directly, just in
>>> >> case you didn't want to send it to chat...
>>> >>
>>> >> In any case, the whole goal is to get the nub of the 9 character
>>> >> infixes. There is special code for some of the mathematical operations
>>> >> on infixes (such as sum) that don't generate all the infixes, however
>>> >> I having found anything for nub (distinct), which makes some sense
>>> >> since it can't be reduced to an atom like the math operations
>>> >>
>>> >> Thanks for the ideas
>>> >>
>>> >>
>>> >> On Tue, Feb 3, 2015 at 3:42 PM, Robert Bernecky
>>> >> <[email protected]> wrote:
>>> >>>
>>> >>> "9 character infixes" - that looks like an input to a convolution, does
>>> >>> it
>>> >>> not?
>>> >>> If so, then there should be some J dohickey (adverb or conjunction)
>>> >>>   (My J is very very rusty, and likely obsolete...)
>>> >>> to do that that will invoke your kernel verb without generating all the
>>> >>> infixes.
>>> >>>
>>> >>> By convolution here, I mean something like moving-window inner
>>> >>> product, string search, or, err, convolution.
>>> >>>
>>> >>> What processing do you apply to each infix?
>>> >>>
>>> >>> Bob
>>> >>>
>>> >>>
>>> >>>
>>> >>> On 15-02-03 03:36 PM, Joe Bogner wrote:
>>> >>>>
>>> >>>> Thanks Bob. I agree that compiling often helps. I have previously
>>> >>>> kicked around the idea of generating byte code or machine code and
>>> >>>> interpreting it or executing it. I have called dlls in other
>>> >>>> circumstances, but it would be nice to have everything in J.
>>> >>>>
>>> >>>> In this case, the challenge put a 2GB constraint on memory and 60
>>> >>>> seconds of execution time. It's an artificial constraint, which is why
>>> >>>> I posted to chat. Those constraint wouldn't exist in the real world.
>>> >>>>
>>> >>>> I'm scanning a 240MB text file and generating 9 character infixes and
>>> >>>> processing them. I hit the memory limit using 9]\ so I figured I'd try
>>> >>>> looping and using memr (15!:1). In an earlier version, about 270
>>> >>>> seconds were spent in the looping construct and 40 seconds in memr. It
>>> >>>> seemed like there should be a faster looping construct
>>> >>>>
>>> >>>> Here's a simple version:
>>> >>>>
>>> >>>> txt=: a. {~ (97&+(i. 26))
>>> >>>> addr=: mema (# txt)
>>> >>>> txt memw addr,0,(#txt)
>>> >>>> 6!:2 '([:<:([:0&{::(] ; smoutput&([:15!:1(addr,9,~<:@:]))    ))) ^:]
>>> >>>> (<:#txt)'
>>> >>>>
>>> >>>>
>>> >>>>      yz
>>> >>>> xyz
>>> >>>> wxyz
>>> >>>> vwxyz
>>> >>>> uvwxyz
>>> >>>> tuvwxyz
>>> >>>> stuvwxyz
>>> >>>> rstuvwxyz
>>> >>>> qrstuvwxy
>>> >>>> pqrstuvwx
>>> >>>> opqrstuvw
>>> >>>> nopqrstuv
>>> >>>> mnopqrstu
>>> >>>> lmnopqrst
>>> >>>> klmnopqrs
>>> >>>> jklmnopqr
>>> >>>> ijklmnopq
>>> >>>> hijklmnop
>>> >>>> ghijklmno
>>> >>>> fghijklmn
>>> >>>> efghijklm
>>> >>>> defghijkl
>>> >>>> cdefghijk
>>> >>>> bcdefghij
>>> >>>> abcdefghi
>>> >>>>
>>> >>>>
>>> >>>> I would need more processing of the results, but you can see roughly
>>> >>>> how it compares to 9]\
>>> >>>>
>>> >>>> 9[\txt
>>> >>>> abcdefghi
>>> >>>> bcdefghij
>>> >>>> cdefghijk
>>> >>>> defghijkl
>>> >>>> efghijklm
>>> >>>> fghijklmn
>>> >>>> ghijklmno
>>> >>>> hijklmnop
>>> >>>> ijklmnopq
>>> >>>> jklmnopqr
>>> >>>> klmnopqrs
>>> >>>> lmnopqrst
>>> >>>> mnopqrstu
>>> >>>> nopqrstuv
>>> >>>> opqrstuvw
>>> >>>> pqrstuvwx
>>> >>>> qrstuvwxy
>>> >>>> rstuvwxyz
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Feb 3, 2015 at 3:05 PM, Robert Bernecky
>>> >>>> <[email protected]> wrote:
>>> >>>>>
>>> >>>>> Compiling often helps. When I compile looping, scalar-dominated
>>> >>>>> APL code, I generally get about a factor of 1000X speedup.
>>> >>>>> You should get the same sort of speedup with compiled J.
>>> >>>>> [No, I do not have a J compiler, but will be happy to create one
>>> >>>>> if funding were to materialize.]
>>> >>>>>
>>> >>>>> However, you mentioned that you were "looping due to memory
>>> >>>>> constraints...". Is it possible that a different algorithm would let
>>> >>>>> you do less looping? E.g., looping over 1e4 elements at once,
>>> >>>>> or using a different approach, such as dynamic programming?
>>> >>>>>
>>> >>>>> Bob
>>> >>>>>
>>> >>>>>
>>> >>>>> On 15-02-03 02:53 PM, Joe Bogner wrote:
>>> >>>>>>
>>> >>>>>> I'm playing around with a coding challenge that I'm solving with a
>>> >>>>>> loop that needs to execute over 200 million times.
>>> >>>>>>
>>> >>>>>> Setting aside whether that's a smart approach, what is the quickest
>>> >>>>>> looping construct in J? I realize that J is not optimized for this
>>> >>>>>> style of programming, but I'm still curious if it can be done
>>> >>>>>> efficiently.
>>> >>>>>>
>>> >>>>>> The quickest I've found takes about 7 seconds to iterate 100 million
>>> >>>>>> times, whereas the same loop in C takes 0.03 seconds.
>>> >>>>>>
>>> >>>>>> 6!:2 '(] [ ])^:] 100e6'
>>> >>>>>> 7.00952
>>> >>>>>>
>>> >>>>>> The train is taking up some time, but eventually I will need it
>>> >>>>>>
>>> >>>>>> No train - 2.7 seconds
>>> >>>>>>
>>> >>>>>> 6!:2 ']^:] 100e6'
>>> >>>>>> 2.70656
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Explicit:
>>> >>>>>>
>>> >>>>>> loop=: 3 : 0
>>> >>>>>> ctr=:0
>>> >>>>>> while. 1 do.
>>> >>>>>> ctr=:>:ctr
>>> >>>>>> if. y < ctr do.
>>> >>>>>> smoutput 'done'
>>> >>>>>> return.
>>> >>>>>> end.
>>> >>>>>> end.
>>> >>>>>> )
>>> >>>>>>
>>> >>>>>> 6!:2 'loop 100e6'
>>> >>>>>> 122.48
>>> >>>>>>
>>> >>>>>> Clearly explicit isn't the way to go
>>> >>>>>>
>>> >>>>>> Back to the tacit:
>>> >>>>>>
>>> >>>>>> log=:smoutput bind ]
>>> >>>>>>
>>> >>>>>>       ([: <: [: 0&{:: ] ; log )^:] 10
>>> >>>>>> 10
>>> >>>>>> 9
>>> >>>>>> 8
>>> >>>>>> 7
>>> >>>>>> 6
>>> >>>>>> 5
>>> >>>>>> 4
>>> >>>>>> 3
>>> >>>>>> 2
>>> >>>>>> 1
>>> >>>>>> 0
>>> >>>>>>
>>> >>>>>> I would replace log with my function that uses the counter value
>>> >>>>>>
>>> >>>>>> That's pretty slow though on 100 million iterations - 88 seconds
>>> >>>>>>
>>> >>>>>> 6!:2 '([: <: [: 0&{:: ] ; ] )^:] 100e6'
>>> >>>>>> 88.3644
>>> >>>>>>
>>> >>>>>> Let's try a new log
>>> >>>>>>
>>> >>>>>>      log=: ] <:@:[ (smoutput bind ])
>>> >>>>>>       log 55
>>> >>>>>> 55
>>> >>>>>> 54
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> That looks like it could work. Let's put a dummy in for now
>>> >>>>>>
>>> >>>>>>      log=: ] <:@:[ ]
>>> >>>>>>
>>> >>>>>> 6!:2 'log^:] 100e6'
>>> >>>>>> 40.9542
>>> >>>>>>
>>> >>>>>> Still 40 seconds... Any ideas on how to speed up iteration like
>>> this?
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Sidenote: I'm looping due to memory constraints placed on the coding
>>> >>>>>> challenge
>>> >>>>>>
>>> ----------------------------------------------------------------------
>>> >>>>>> For information about J forums see
>>> http://www.jsoftware.com/forums.htm
>>> >>>>>>
>>> >>>>> --
>>> >>>>> Robert Bernecky
>>> >>>>> Snake Island Research Inc
>>> >>>>> 18 Fifth Street
>>> >>>>> Ward's Island
>>> >>>>> Toronto, Ontario M5J 2B9
>>> >>>>>
>>> >>>>> [email protected]
>>> >>>>> tel: +1 416 203 0854
>>> >>>>>
>>> >>>
>>> >>> --
>>> >>> Robert Bernecky
>>> >>> Snake Island Research Inc
>>> >>> 18 Fifth Street
>>> >>> Ward's Island
>>> >>> Toronto, Ontario M5J 2B9
>>> >>>
>>> >>> [email protected]
>>> >>> tel: +1 416 203 0854
>>> >>>
>>> >
>>> >
>>> > --
>>> > Robert Bernecky
>>> > Snake Island Research Inc
>>> > 18 Fifth Street
>>> > Ward's Island
>>> > Toronto, Ontario M5J 2B9
>>> >
>>> > [email protected]
>>> > tel: +1 416 203 0854
>>> >
>>> > ----------------------------------------------------------------------
>>> > For information about J forums see http://www.jsoftware.com/forums.htm
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>
>>
>>
>>
>> --
>> Devon McCormick, CFA
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to