I think I would pay for k's database capability. --Kip Murray Sent from my iPad
> On Apr 8, 2014, at 12:46 PM, Björn Helgason <[email protected]> wrote: > > I would take a look at the mapped file database lab to get ideas. > > - > Björn Helgason > gsm:6985532 > skype:gosiminn >> On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote: >> >> I have thought about using symbols, but the only way to delete symbols that >> I know of involves exiting J. And, my starting premise was that I would >> have too much data to fit into memory. >> >> For some computations it does make sense to start up an independent J >> session for each part of the calculation (and, in fact, that is what I am >> doing in a different aspect of dealing with this dataset - it's about 10 >> terabytes, or so I am told - I've not actually seen it all yet and it takes >> time to upload it). But for some calculations you need to be able to >> correlate between pieces which have been dealt with elsewhere. >> >> A have similar reservations about fixed-width fields. There's just too much >> data for me to predict how wide the fields are going to be. In some cases I >> might actually be going with fixed-width, but that might be too inefficient >> for the general case. I've one field which would have to be over 100k in >> width if it was fixed width, even though typical cases are shorter than 1k. >> At some point I might go with fixed width, and I expect that doing so will >> cause me to lose a few records which will be discovered later in >> processing. That might not be a big deal, for this large of a data set, but >> if it's not necessary why bother? >> >> Finally, Bjorn's suggestion of using mapped files does seem like a good >> idea, at least for the character data. But that is an optimization and >> optimizations speed up some operations at the expense of slowing down other >> optimizations. So what really matters is the workload. >> >> Ultimately, for a dataset this large, it's going to take time. >> >> Thanks, >> >> -- >> Raul >> >> >> >> >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <[email protected]> wrote: >>> >>> It seems this representation is somewhat similar to how the symbol table >>> stores strings: >>> >>> http://m.jsoftware.com/help/dictionary/dsco.htm >>> >>> Also, did you consider using symbols? I've used symbols for string >> columns >>> that contain highly repetitive data, for example, an invoice table with >> an >>> alpha-numeric SKU. >>> >>> Thanks for sharing >>> >>> >>> >>> >>> >>> >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <[email protected]> >> wrote: >>> >>>> Consider this example: >>>> >>>> table=:<;._2;._2]0 :0 >>>> First Name,Last Name,Sum, >>>> Adam,Wallace,19, >>>> Travis,Smith,10, >>>> Donald,Barnell,8, >>>> Gary,Wallace,27, >>>> James,Smith,10, >>>> Sam,Johnson,10, >>>> Travis,Neal,11, >>>> Adam,Campbell,11, >>>> Walter,Abbott,13, >>>> ) >>>> >>>> Using boxed strings works great for relatively small sets of data. But >>> when >>>> things get big, their overhead starts to hurt to much. (Big means: so >>> much >>>> data that you'll probably not be able to fit it all in memory at the >> same >>>> time. So you need to plan on relatively frequent delays while reading >>> from >>>> disk.) >>>> >>>> One alternative to boxed strings is segmented strings. A segmented >> string >>>> is an argument which could be passed to <;._1. It's basically just a >>> string >>>> with a prefix delimiter. You can work with these sorts of strings >>> directly, >>>> and achieve results similar to what you would achieve with boxed >> arrays. >>>> >>>> Segmented strings are a bit clumsier than boxed arrays - you lose a lot >>> of >>>> the integrity checks, so if you mess up you probably will not see an >>> error. >>>> So it's probably a good idea to model your code using boxed arrays on a >>>> small set of data and then convert to segmented representation once >>> you're >>>> happy with how things work (and once you see a time cost that makes it >>>> worth spending the time to rework your code). >>>> >>>> Also, to avoid having to use f;._2 (or whatever) every time, it's good >> to >>>> do an initial pass on the data, to extract its structure. >>>> >>>> Here's an example: >>>> >>>> FirstName=:;LF&,each }.0{"1 table >>>> >>>> LastName=:;LF&,each }.1{"1 table >>>> >>>> Sum=:;LF&,each }.2{"1 table >>>> >>>> >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),# >>>> >>>> FirstNameDir=: ssdir FirstName >>>> LastNameDir=: ssdir LastName >>>> >>>> Actually, sum is numeric so let's just use a numeric representation for >>>> that column >>>> >>>> Sum=: _&".@> }.2{"1 table >>>> >>>> Which rows have a last name of Smith? >>>> >>>> <:({.LastNameDir) I. I.'Smith' E. LastName >>>> >>>> 1 4 >>>> >>>> >>>> Actually, there's an assumption there that Smith is not part of some >>> larger >>>> name. We can include the delimiter in the search if we are concerned >>> about >>>> that. For even more protection we could append a trailing delimiter on >>> our >>>> segmented string and then search for (in this case) LF,'Smith',LF. >>>> >>>> >>>> Anyways, let's extract the corresponding sums and first name: >>>> >>>> >>>> 1 4{Sum >>>> >>>> 10 10 >>>> >>>> >>>> FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir >>>> >>>> >>>> Travis >>>> >>>> James >>>> >>>> >>>> Note that that last expression is a bit complicated. It's not so bad, >>>> though, if what you are extracting is a small part of the total. And, >> in >>>> that case, using a list of indices to express a boolean result seems >>> like a >>>> good thing. You wind up working with set operations (intersection and >>>> union) rather than logical operations (and and or). Also, set >> difference >>>> instead of logical not (dyadic -. instead of monadic -.). >>>> >>>> >>>> intersect=: [ -. -. >>>> >>>> union=. ~.@, >>>> >>>> >>>> (It looks like I might be using this kind of thing really soon, so I >>>> thought I'd lay down my thoughts here and invite comment.) >>>> >>>> >>>> Thanks, >>>> >>>> >>>> -- >>>> >>>> Raul >>>> ---------------------------------------------------------------------- >>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>> >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >>> >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
