Re: [Jprogramming] "Segmented Strings"

km Tue, 08 Apr 2014 11:55:05 -0700

I think I would pay for k's database capability.  --Kip Murray

Sent from my iPad


> On Apr 8, 2014, at 12:46 PM, Björn Helgason <[email protected]> wrote:
> 
> I would take a look at the mapped file database lab to get ideas.
> 
> -
> Björn Helgason
> gsm:6985532
> skype:gosiminn
>> On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote:
>> 
>> I have thought about using symbols, but the only way to delete symbols that
>> I know of involves exiting J. And, my starting premise was that I would
>> have too much data to fit into memory.
>> 
>> For some computations it does make sense to start up an independent J
>> session for each part of the calculation (and, in fact, that is what I am
>> doing in a different aspect of dealing with this dataset - it's about 10
>> terabytes, or so I am told - I've not actually seen it all yet and it takes
>> time to upload it). But for some calculations you need to be able to
>> correlate between pieces which have been dealt with elsewhere.
>> 
>> A have similar reservations about fixed-width fields. There's just too much
>> data for me to predict how wide the fields are going to be. In some cases I
>> might actually be going with fixed-width, but that might be too inefficient
>> for the general case. I've one field which would have to be over 100k in
>> width if it was fixed width, even though typical cases are shorter than 1k.
>> At some point I might go with fixed width, and I expect that doing so will
>> cause me to lose a few records which will be discovered later in
>> processing. That might not be a big deal, for this large of a data set, but
>> if it's not necessary why bother?
>> 
>> Finally, Bjorn's suggestion of using mapped files does seem like a good
>> idea, at least for the character data. But that is an optimization and
>> optimizations speed up some operations at the expense of slowing down other
>> optimizations. So what really matters is the workload.
>> 
>> Ultimately, for a dataset this large, it's going to take time.
>> 
>> Thanks,
>> 
>> --
>> Raul
>> 
>> 
>> 
>> 
>>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <[email protected]> wrote:
>>> 
>>> It seems this representation is somewhat similar to how the symbol table
>>> stores strings:
>>> 
>>> http://m.jsoftware.com/help/dictionary/dsco.htm
>>> 
>>> Also, did you consider using symbols? I've used symbols for string
>> columns
>>> that contain highly repetitive data, for example, an invoice table with
>> an
>>> alpha-numeric SKU.
>>> 
>>> Thanks for sharing
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <[email protected]>
>> wrote:
>>> 
>>>> Consider this example:
>>>> 
>>>> table=:<;._2;._2]0 :0
>>>> First Name,Last Name,Sum,
>>>> Adam,Wallace,19,
>>>> Travis,Smith,10,
>>>> Donald,Barnell,8,
>>>> Gary,Wallace,27,
>>>> James,Smith,10,
>>>> Sam,Johnson,10,
>>>> Travis,Neal,11,
>>>> Adam,Campbell,11,
>>>> Walter,Abbott,13,
>>>> )
>>>> 
>>>> Using boxed strings works great for relatively small sets of data. But
>>> when
>>>> things get big, their overhead starts to hurt to much.  (Big means: so
>>> much
>>>> data that you'll probably not be able to fit it all in memory at the
>> same
>>>> time. So you need to plan on relatively frequent delays while reading
>>> from
>>>> disk.)
>>>> 
>>>> One alternative to boxed strings is segmented strings. A segmented
>> string
>>>> is an argument which could be passed to <;._1. It's basically just a
>>> string
>>>> with a prefix delimiter. You can work with these sorts of strings
>>> directly,
>>>> and achieve results similar to what you would achieve with boxed
>> arrays.
>>>> 
>>>> Segmented strings are a bit clumsier than boxed arrays - you lose a lot
>>> of
>>>> the integrity checks, so if you mess up you probably will not see an
>>> error.
>>>> So it's probably a good idea to model your code using boxed arrays on a
>>>> small set of data and then convert to segmented representation once
>>> you're
>>>> happy with how things work (and once you see a time cost that makes it
>>>> worth spending the time to rework your code).
>>>> 
>>>> Also, to avoid having to use f;._2 (or whatever) every time, it's good
>> to
>>>> do an initial pass on the data, to extract its structure.
>>>> 
>>>> Here's an example:
>>>> 
>>>> FirstName=:;LF&,each }.0{"1 table
>>>> 
>>>> LastName=:;LF&,each }.1{"1 table
>>>> 
>>>> Sum=:;LF&,each }.2{"1 table
>>>> 
>>>> 
>>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),#
>>>> 
>>>> FirstNameDir=: ssdir FirstName
>>>> LastNameDir=: ssdir LastName
>>>> 
>>>> Actually, sum is numeric so let's just use a numeric representation for
>>>> that column
>>>> 
>>>> Sum=: _&".@> }.2{"1 table
>>>> 
>>>> Which rows have a last name of Smith?
>>>> 
>>>>   <:({.LastNameDir) I. I.'Smith' E. LastName
>>>> 
>>>> 1 4
>>>> 
>>>> 
>>>> Actually, there's an assumption there that Smith is not part of some
>>> larger
>>>> name. We can include the delimiter in the search if we are concerned
>>> about
>>>> that. For even more protection we could append a trailing delimiter on
>>> our
>>>> segmented string and then search for (in this case) LF,'Smith',LF.
>>>> 
>>>> 
>>>> Anyways, let's extract the corresponding sums and first name:
>>>> 
>>>> 
>>>>   1 4{Sum
>>>> 
>>>> 10 10
>>>> 
>>>> 
>>>>   FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir
>>>> 
>>>> 
>>>> Travis
>>>> 
>>>> James
>>>> 
>>>> 
>>>> Note that that last expression is a bit complicated. It's not so bad,
>>>> though, if what you are extracting is a small part of the total. And,
>> in
>>>> that case, using a list of indices to express a boolean result seems
>>> like a
>>>> good thing. You wind up working with set operations (intersection and
>>>> union) rather than logical operations (and and or). Also, set
>> difference
>>>> instead of logical not (dyadic -. instead of monadic -.).
>>>> 
>>>> 
>>>> intersect=: [ -. -.
>>>> 
>>>> union=. ~.@,
>>>> 
>>>> 
>>>> (It looks like I might be using this kind of thing really soon, so I
>>>> thought I'd lay down my thoughts here and invite comment.)
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> 
>>>> --
>>>> 
>>>> Raul
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> 
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> 
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] "Segmented Strings"

Reply via email to