In the "real" string database you are designing, will you be able to extract tables of data from it also.
Linda -----Original Message----- ISent: Thursday, April 10, 2014 12:03 AM To: Programming forum Subject: Re: [Jprogramming] "Segmented Strings" I do not understand your question. Could you uncompress it a little? Thanks, -- Raul On Wed, Apr 9, 2014 at 11:59 PM, Linda Alvord <[email protected]> wrote: > Can you still extract tables from it rather than strings? > > Linda > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Raul Miller > Sent: Wednesday, April 09, 2014 9:47 PM > To: Programming forum > Subject: Re: [Jprogramming] "Segmented Strings" > > Oh, I see how you were thinking. > > Actually, the code was secondary - it was only meant to illustrate the > structure of the data. > > In "real life", I will not be using that code to create the segmented > strings. It'll be more involved. > > Thanks, > > -- > Raul > > On Wed, Apr 9, 2014 at 9:43 PM, Linda Alvord <[email protected]> wrote: >> Your example FirstName=:;LF&,each }.0{"1 table is a string creation. >> >> Mine ]FN2=: >"0 }.0{"1 table is a table. >> >> If you create tables of character dat and tables of the numeric data >> separately, you could transform the numeric data and then join columns to >> columns or rows to rows. >> >> More dimensions could be created as well and then joined in ways to >> summarize the useful data and finally rejoin the results. >> >> My suggestion is really only related to giving thought to how best to >> extract and use the string table you have created. >> >> Linda >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Raul Miller >> Sent: Wednesday, April 09, 2014 10:06 AM >> To: Programming forum >> Subject: Re: [Jprogramming] "Segmented Strings" >> >> How? >> >> Thanks, >> >> -- >> Raul >> >> >> >> On Wed, Apr 9, 2014 at 3:44 AM, Linda Alvord <[email protected]>wrote: >> >>> I would not get rid of your table made of strings. I would access it in >>> the form of J tables because that is what J does nicely. >>> >>> Linda >>> >>> -----Original Message----- >>> From: [email protected] [mailto: >>> [email protected]] On Behalf Of Raul Miller >>> Sent: Wednesday, April 09, 2014 2:48 AM >>> To: Programming forum >>> Subject: Re: [Jprogramming] "Segmented Strings" >>> >>> The plan is that segmented strings are the data in the database. >>> >>> There's just too much information to hold it all in memory on a single >>> machine. >>> >>> Thanks, >>> >>> -- >>> Raul >>> >>> >>> On Wed, Apr 9, 2014 at 2:23 AM, Linda Alvord <[email protected] >>> >wrote: >>> >>> > I know almost nothing about large databases, but what is the advantage of >>> > staying with sstrings after the data base is built? >>> > >>> > Once you have your table, or maybe two or more tables of character and >>> > numeric data, you might "stay in J and make "subtables" which can be >>> > catenated together and destroyed as needed. You could also do selections >>> > of subsets more easily. >>> > >>> > ]FirstName=:;LF&,each }.0{"1 table >>> > >>> > Adam >>> > Travis >>> > Donald >>> > Gary >>> > James >>> > Sam >>> > Travis >>> > Adam >>> > Walter >>> > >>> > ]FN2=: >"0 }.0{"1 table >>> > Adam >>> > Travis >>> > Donald >>> > Gary >>> > James >>> > Sam >>> > Travis >>> > Adam >>> > Walter >>> > >>> > FN2-:FirstName >>> > 0 >>> > $FirstName >>> > 53 >>> > $FN2 >>> > 9 6 >>> > >>> > Linda >>> > >>> > >>> > -----Original Message----- >>> > From: [email protected] [mailto: >>> > [email protected]] On Behalf Of Raul Miller >>> > Sent: Tuesday, April 08, 2014 8:22 PM >>> > To: Programming forum >>> > Subject: Re: [Jprogramming] "Segmented Strings" >>> > >>> > I might indeed do that, but in some cases the time to read the file >>> itself >>> > will be mostly network transfer time. And, once it's in memory, how it >>> got >>> > there isn't really an issue. >>> > >>> > Still, it's worth benchmarking. >>> > >>> > Thanks, >>> > >>> > -- >>> > Raul >>> > >>> > >>> > On Tue, Apr 8, 2014 at 8:18 PM, Vijay Lulla <[email protected]> >>> wrote: >>> > >>> > > I second memory mapped files and mapped file database. >>> > > >>> > > >>> > > On Tue, Apr 8, 2014 at 4:51 PM, Raul Miller <[email protected]> >>> > wrote: >>> > > >>> > > > It's available for free now, with some limitations: >>> > > > >>> > > > http://kx.com/software-download.php >>> > > > >>> > > > It'll take me a few years, though, to develop a fluency in K (Q >>> > actually, >>> > > > or kdb+ ...) which approaches my fluency in other languages. Anyways, >>> > > it's >>> > > > not at all clear that K (or Q or KDB+) would be any better for this >>> > > > application than J. The grass is always greener on the other side of >>> > the >>> > > > fence, especially after you've crossed it? >>> > > > >>> > > > Also, if I do my job properly, the language itself becomes irrelevant >>> > and >>> > > > the data structures are straightforward enough to allow any arbitrary >>> > > > language to be used. >>> > > > >>> > > > (Meanwhile, I've got J running on OpenBSD, which pleases me.) >>> > > > >>> > > > -- >>> > > > Raul >>> > > > >>> > > > Thanks, >>> > > > >>> > > > -- >>> > > > Raul >>> > > > >>> > > > >>> > > > On Tue, Apr 8, 2014 at 2:54 PM, km <[email protected]> wrote: >>> > > > >>> > > > > I think I would pay for k's database capability. --Kip Murray >>> > > > > >>> > > > > Sent from my iPad >>> > > > > >>> > > > > > On Apr 8, 2014, at 12:46 PM, Björn Helgason <[email protected]> >>> > > wrote: >>> > > > > > >>> > > > > > I would take a look at the mapped file database lab to get ideas. >>> > > > > > >>> > > > > > - >>> > > > > > Björn Helgason >>> > > > > > gsm:6985532 >>> > > > > > skype:gosiminn >>> > > > > >> On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote: >>> > > > > >> >>> > > > > >> I have thought about using symbols, but the only way to delete >>> > > symbols >>> > > > > that >>> > > > > >> I know of involves exiting J. And, my starting premise was that >>> I >>> > > > would >>> > > > > >> have too much data to fit into memory. >>> > > > > >> >>> > > > > >> For some computations it does make sense to start up an >>> > independent >>> > > J >>> > > > > >> session for each part of the calculation (and, in fact, that is >>> > > what I >>> > > > > am >>> > > > > >> doing in a different aspect of dealing with this dataset - it's >>> > > about >>> > > > 10 >>> > > > > >> terabytes, or so I am told - I've not actually seen it all yet >>> and >>> > > it >>> > > > > takes >>> > > > > >> time to upload it). But for some calculations you need to be >>> able >>> > to >>> > > > > >> correlate between pieces which have been dealt with elsewhere. >>> > > > > >> >>> > > > > >> A have similar reservations about fixed-width fields. There's >>> just >>> > > too >>> > > > > much >>> > > > > >> data for me to predict how wide the fields are going to be. In >>> > some >>> > > > > cases I >>> > > > > >> might actually be going with fixed-width, but that might be too >>> > > > > inefficient >>> > > > > >> for the general case. I've one field which would have to be over >>> > > 100k >>> > > > in >>> > > > > >> width if it was fixed width, even though typical cases are >>> shorter >>> > > > than >>> > > > > 1k. >>> > > > > >> At some point I might go with fixed width, and I expect that >>> doing >>> > > so >>> > > > > will >>> > > > > >> cause me to lose a few records which will be discovered later in >>> > > > > >> processing. That might not be a big deal, for this large of a >>> data >>> > > > set, >>> > > > > but >>> > > > > >> if it's not necessary why bother? >>> > > > > >> >>> > > > > >> Finally, Bjorn's suggestion of using mapped files does seem >>> like a >>> > > > good >>> > > > > >> idea, at least for the character data. But that is an >>> optimization >>> > > and >>> > > > > >> optimizations speed up some operations at the expense of slowing >>> > > down >>> > > > > other >>> > > > > >> optimizations. So what really matters is the workload. >>> > > > > >> >>> > > > > >> Ultimately, for a dataset this large, it's going to take time. >>> > > > > >> >>> > > > > >> Thanks, >>> > > > > >> >>> > > > > >> -- >>> > > > > >> Raul >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner < >>> [email protected]> >>> > > > > wrote: >>> > > > > >>> >>> > > > > >>> It seems this representation is somewhat similar to how the >>> > symbol >>> > > > > table >>> > > > > >>> stores strings: >>> > > > > >>> >>> > > > > >>> http://m.jsoftware.com/help/dictionary/dsco.htm >>> > > > > >>> >>> > > > > >>> Also, did you consider using symbols? I've used symbols for >>> > string >>> > > > > >> columns >>> > > > > >>> that contain highly repetitive data, for example, an invoice >>> > table >>> > > > with >>> > > > > >> an >>> > > > > >>> alpha-numeric SKU. >>> > > > > >>> >>> > > > > >>> Thanks for sharing >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller < >>> > [email protected] >>> > > > >>> > > > > >> wrote: >>> > > > > >>> >>> > > > > >>>> Consider this example: >>> > > > > >>>> >>> > > > > >>>> table=:<;._2;._2]0 :0 >>> > > > > >>>> First Name,Last Name,Sum, >>> > > > > >>>> Adam,Wallace,19, >>> > > > > >>>> Travis,Smith,10, >>> > > > > >>>> Donald,Barnell,8, >>> > > > > >>>> Gary,Wallace,27, >>> > > > > >>>> James,Smith,10, >>> > > > > >>>> Sam,Johnson,10, >>> > > > > >>>> Travis,Neal,11, >>> > > > > >>>> Adam,Campbell,11, >>> > > > > >>>> Walter,Abbott,13, >>> > > > > >>>> ) >>> > > > > >>>> >>> > > > > >>>> Using boxed strings works great for relatively small sets of >>> > data. >>> > > > But >>> > > > > >>> when >>> > > > > >>>> things get big, their overhead starts to hurt to much. (Big >>> > > means: >>> > > > so >>> > > > > >>> much >>> > > > > >>>> data that you'll probably not be able to fit it all in memory >>> at >>> > > the >>> > > > > >> same >>> > > > > >>>> time. So you need to plan on relatively frequent delays while >>> > > > reading >>> > > > > >>> from >>> > > > > >>>> disk.) >>> > > > > >>>> >>> > > > > >>>> One alternative to boxed strings is segmented strings. A >>> > segmented >>> > > > > >> string >>> > > > > >>>> is an argument which could be passed to <;._1. It's basically >>> > > just a >>> > > > > >>> string >>> > > > > >>>> with a prefix delimiter. You can work with these sorts of >>> > strings >>> > > > > >>> directly, >>> > > > > >>>> and achieve results similar to what you would achieve with >>> boxed >>> > > > > >> arrays. >>> > > > > >>>> >>> > > > > >>>> Segmented strings are a bit clumsier than boxed arrays - you >>> > lose >>> > > a >>> > > > > lot >>> > > > > >>> of >>> > > > > >>>> the integrity checks, so if you mess up you probably will not >>> > see >>> > > an >>> > > > > >>> error. >>> > > > > >>>> So it's probably a good idea to model your code using boxed >>> > arrays >>> > > > on >>> > > > > a >>> > > > > >>>> small set of data and then convert to segmented representation >>> > > once >>> > > > > >>> you're >>> > > > > >>>> happy with how things work (and once you see a time cost that >>> > > makes >>> > > > it >>> > > > > >>>> worth spending the time to rework your code). >>> > > > > >>>> >>> > > > > >>>> Also, to avoid having to use f;._2 (or whatever) every time, >>> > it's >>> > > > good >>> > > > > >> to >>> > > > > >>>> do an initial pass on the data, to extract its structure. >>> > > > > >>>> >>> > > > > >>>> Here's an example: >>> > > > > >>>> >>> > > > > >>>> FirstName=:;LF&,each }.0{"1 table >>> > > > > >>>> >>> > > > > >>>> LastName=:;LF&,each }.1{"1 table >>> > > > > >>>> >>> > > > > >>>> Sum=:;LF&,each }.2{"1 table >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),# >>> > > > > >>>> >>> > > > > >>>> FirstNameDir=: ssdir FirstName >>> > > > > >>>> LastNameDir=: ssdir LastName >>> > > > > >>>> >>> > > > > >>>> Actually, sum is numeric so let's just use a numeric >>> > > representation >>> > > > > for >>> > > > > >>>> that column >>> > > > > >>>> >>> > > > > >>>> Sum=: _&".@> }.2{"1 table >>> > > > > >>>> >>> > > > > >>>> Which rows have a last name of Smith? >>> > > > > >>>> >>> > > > > >>>> <:({.LastNameDir) I. I.'Smith' E. LastName >>> > > > > >>>> >>> > > > > >>>> 1 4 >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> Actually, there's an assumption there that Smith is not part >>> of >>> > > some >>> > > > > >>> larger >>> > > > > >>>> name. We can include the delimiter in the search if we are >>> > > concerned >>> > > > > >>> about >>> > > > > >>>> that. For even more protection we could append a trailing >>> > > delimiter >>> > > > on >>> > > > > >>> our >>> > > > > >>>> segmented string and then search for (in this case) >>> > LF,'Smith',LF. >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> Anyways, let's extract the corresponding sums and first name: >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> 1 4{Sum >>> > > > > >>>> >>> > > > > >>>> 10 10 >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> Travis >>> > > > > >>>> >>> > > > > >>>> James >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> Note that that last expression is a bit complicated. It's not >>> so >>> > > > bad, >>> > > > > >>>> though, if what you are extracting is a small part of the >>> total. >>> > > > And, >>> > > > > >> in >>> > > > > >>>> that case, using a list of indices to express a boolean result >>> > > seems >>> > > > > >>> like a >>> > > > > >>>> good thing. You wind up working with set operations >>> > (intersection >>> > > > and >>> > > > > >>>> union) rather than logical operations (and and or). Also, set >>> > > > > >> difference >>> > > > > >>>> instead of logical not (dyadic -. instead of monadic -.). >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> intersect=: [ -. -. >>> > > > > >>>> >>> > > > > >>>> union=. ~.@, >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> (It looks like I might be using this kind of thing really >>> soon, >>> > > so I >>> > > > > >>>> thought I'd lay down my thoughts here and invite comment.) >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> Thanks, >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> -- >>> > > > > >>>> >>> > > > > >>>> Raul >>> > > > > >>>> >>> > > > >>> ---------------------------------------------------------------------- >>> > > > > >>>> For information about J forums see >>> > > > > http://www.jsoftware.com/forums.htm >>> > > > > >>>> >>> > > > > >>> >>> > > > >>> ---------------------------------------------------------------------- >>> > > > > >>> For information about J forums see >>> > > > http://www.jsoftware.com/forums.htm >>> > > > > >>> >>> > > > > >> >>> > > ---------------------------------------------------------------------- >>> > > > > >> For information about J forums see >>> > > > http://www.jsoftware.com/forums.htm >>> > > > > >> >>> > > > > > >>> > > ---------------------------------------------------------------------- >>> > > > > > For information about J forums see >>> > > http://www.jsoftware.com/forums.htm >>> > > > > >>> > ---------------------------------------------------------------------- >>> > > > > For information about J forums see >>> > http://www.jsoftware.com/forums.htm >>> > > > > >>> > > > >>> ---------------------------------------------------------------------- >>> > > > For information about J forums see >>> http://www.jsoftware.com/forums.htm >>> > > > >>> > > ---------------------------------------------------------------------- >>> > > For information about J forums see http://www.jsoftware.com/forums.htm >>> > > >>> > ---------------------------------------------------------------------- >>> > For information about J forums see http://www.jsoftware.com/forums.htm >>> > >>> > ---------------------------------------------------------------------- >>> > For information about J forums see http://www.jsoftware.com/forums.htm >>> > >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >>> >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >>> >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
