Its very cool to develop routines to treat delimited files as data. This can make sense as large read only reference data, or generally low priority "discardable" data. Your example might be a file that is generated by an "expensive" other query process and will be replaced next time it is run.
There are a few alternatives though. The first thing to realize about most string data is that it is rarely changed, and acts as baggage that is "only useful" for lookup. What you are doing implicitly is setting id numbers for each record, and the most important step is getting "the sum" as a separate numeric data structure. The sum is the real data being maintained here, and usually all real data (as opposed to tracking/lookup baggage) is numeric. A general approach I take is to separate rarely changing data from real data into their own groups and have either 1 file or an inverted table box (like 1 file, but one box for entire content of file, and so 1 data structure includes content of all files) for each group. binary flags whether they change frequently or not deserve their own group if only for the space savings. With the text file approach, you could consider fixed length records so as to more quickly retrieve a list of indexes (say all names with sum > 10). In terms of your specific data set example, I'd suggest that you have no real query needs for looking up all people named Smith, so another alternative is to keep the original data structure, and just take out sum as additional real data track. Your data currently has no way to tell multiple James Smith's apart, and so you already need id keys (which you probably have). In terms of lookup/queries, searching by name would probably only happen when a "customer service" rep has someone on the phone who doesn't know his own id, and so you need to retrieve all records that match name, so as to ask more questions to figure out their id. So there is no obvious (from assumptions that may well be wrong) benefit from separating original data. The reason I bring this up is that searching for /retrieving LF cut items, involves cutting the whole file first. If you are going to be retrieving fields from the same record, its better to cut 1 file than 3 (or more). ie. only parsing , cut for retrieved records is likely much more efficient. ----- Original Message ----- From: Raul Miller <[email protected]> To: Programming forum <[email protected]> Cc: Sent: Tuesday, April 8, 2014 2:40:00 AM Subject: [Jprogramming] "Segmented Strings" Consider this example: table=:<;._2;._2]0 :0 First Name,Last Name,Sum, Adam,Wallace,19, Travis,Smith,10, Donald,Barnell,8, Gary,Wallace,27, James,Smith,10, Sam,Johnson,10, Travis,Neal,11, Adam,Campbell,11, Walter,Abbott,13, ) Using boxed strings works great for relatively small sets of data. But when things get big, their overhead starts to hurt to much. (Big means: so much data that you'll probably not be able to fit it all in memory at the same time. So you need to plan on relatively frequent delays while reading from disk.) One alternative to boxed strings is segmented strings. A segmented string is an argument which could be passed to <;._1. It's basically just a string with a prefix delimiter. You can work with these sorts of strings directly, and achieve results similar to what you would achieve with boxed arrays. Segmented strings are a bit clumsier than boxed arrays - you lose a lot of the integrity checks, so if you mess up you probably will not see an error. So it's probably a good idea to model your code using boxed arrays on a small set of data and then convert to segmented representation once you're happy with how things work (and once you see a time cost that makes it worth spending the time to rework your code). Also, to avoid having to use f;._2 (or whatever) every time, it's good to do an initial pass on the data, to extract its structure. Here's an example: FirstName=:;LF&,each }.0{"1 table LastName=:;LF&,each }.1{"1 table Sum=:;LF&,each }.2{"1 table ssdir=: [:(}:,:2-~/\])I.@(= {.),# FirstNameDir=: ssdir FirstName LastNameDir=: ssdir LastName Actually, sum is numeric so let's just use a numeric representation for that column Sum=: _&".@> }.2{"1 table Which rows have a last name of Smith? <:({.LastNameDir) I. I.'Smith' E. LastName 1 4 Actually, there's an assumption there that Smith is not part of some larger name. We can include the delimiter in the search if we are concerned about that. For even more protection we could append a trailing delimiter on our segmented string and then search for (in this case) LF,'Smith',LF. Anyways, let's extract the corresponding sums and first name: 1 4{Sum 10 10 FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir Travis James Note that that last expression is a bit complicated. It's not so bad, though, if what you are extracting is a small part of the total. And, in that case, using a list of indices to express a boolean result seems like a good thing. You wind up working with set operations (intersection and union) rather than logical operations (and and or). Also, set difference instead of logical not (dyadic -. instead of monadic -.). intersect=: [ -. -. union=. ~.@, (It looks like I might be using this kind of thing really soon, so I thought I'd lay down my thoughts here and invite comment.) Thanks, -- Raul ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
