Re: [Jprogramming] "Segmented Strings"

Pascal Jasmin Tue, 08 Apr 2014 06:29:35 -0700

Its very cool to develop routines to treat delimited files as data.  This can 
make sense as large read only reference data, or generally low priority 
"discardable" data.  Your example might be a file that is generated by an 
"expensive" other query process and will be replaced next time it is run.


There are a few alternatives though.  The first thing to realize about most 
string data is that it is rarely changed, and acts as baggage that is "only 
useful" for lookup.  What you are doing implicitly is setting id numbers for 
each record, and the most important step is getting "the sum" as a separate 
numeric data structure.  The sum is the real data being maintained here, and 
usually all real data (as opposed to tracking/lookup baggage) is numeric.  A 
general approach I take is to separate rarely changing data from real data into 
their own groups and have either 1 file or an inverted table box (like 1 file, 
but one box for entire content of file, and so 1 data structure includes 
content of all files) for each group.  binary flags whether they change 
frequently or not deserve their own group if only for the space savings.

With the text file approach, you could consider fixed length records so as to 
more quickly retrieve a list of indexes (say all names with sum > 10).  In 
terms of your specific data set example, I'd suggest that you have no real 
query needs for looking up all people named Smith, so another alternative is to 
keep the original data structure, and just take out sum as additional real data 
track.  Your data currently has no way to tell multiple James Smith's apart, 
and so you already need id keys (which you probably have).

In terms of lookup/queries, searching by name would probably only happen when a 
"customer service" rep has someone on the phone who doesn't know his own id, 
and so you need to retrieve all records that match name, so as to ask more 
questions to figure out their id.  So there is no obvious (from assumptions 
that may well be wrong) benefit from separating original data.  

The reason I bring this up is that searching for /retrieving LF cut items, 
involves cutting the whole file first.  If you are going to be retrieving 
fields from the same record, its better to cut 1 file than 3 (or more).  ie. 
only parsing , cut for retrieved records is likely much more efficient.


----- Original Message -----
From: Raul Miller <[email protected]>
To: Programming forum <[email protected]>
Cc: 
Sent: Tuesday, April 8, 2014 2:40:00 AM
Subject: [Jprogramming] "Segmented Strings"

Consider this example:

table=:<;._2;._2]0 :0
First Name,Last Name,Sum,
Adam,Wallace,19,
Travis,Smith,10,
Donald,Barnell,8,
Gary,Wallace,27,
James,Smith,10,
Sam,Johnson,10,
Travis,Neal,11,
Adam,Campbell,11,
Walter,Abbott,13,
)

Using boxed strings works great for relatively small sets of data. But when
things get big, their overhead starts to hurt to much.  (Big means: so much
data that you'll probably not be able to fit it all in memory at the same
time. So you need to plan on relatively frequent delays while reading from
disk.)

One alternative to boxed strings is segmented strings. A segmented string
is an argument which could be passed to <;._1. It's basically just a string
with a prefix delimiter. You can work with these sorts of strings directly,
and achieve results similar to what you would achieve with boxed arrays.

Segmented strings are a bit clumsier than boxed arrays - you lose a lot of
the integrity checks, so if you mess up you probably will not see an error.
So it's probably a good idea to model your code using boxed arrays on a
small set of data and then convert to segmented representation once you're
happy with how things work (and once you see a time cost that makes it
worth spending the time to rework your code).

Also, to avoid having to use f;._2 (or whatever) every time, it's good to
do an initial pass on the data, to extract its structure.

Here's an example:

FirstName=:;LF&,each }.0{"1 table

LastName=:;LF&,each }.1{"1 table

Sum=:;LF&,each }.2{"1 table


ssdir=: [:(}:,:2-~/\])I.@(= {.),#

FirstNameDir=: ssdir FirstName
LastNameDir=: ssdir LastName

Actually, sum is numeric so let's just use a numeric representation for
that column

Sum=: _&".@> }.2{"1 table

Which rows have a last name of Smith?

   <:({.LastNameDir) I. I.'Smith' E. LastName

1 4


Actually, there's an assumption there that Smith is not part of some larger
name. We can include the delimiter in the search if we are concerned about
that. For even more protection we could append a trailing delimiter on our
segmented string and then search for (in this case) LF,'Smith',LF.


Anyways, let's extract the corresponding sums and first name:


   1 4{Sum

10 10


   FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir


Travis

James


Note that that last expression is a bit complicated. It's not so bad,
though, if what you are extracting is a small part of the total. And, in
that case, using a list of indices to express a boolean result seems like a
good thing. You wind up working with set operations (intersection and
union) rather than logical operations (and and or). Also, set difference
instead of logical not (dyadic -. instead of monadic -.).


intersect=: [ -. -.

union=. ~.@,


(It looks like I might be using this kind of thing really soon, so I
thought I'd lay down my thoughts here and invite comment.)


Thanks,


-- 

Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] "Segmented Strings"

Reply via email to