If by "table" you mean something that has columns of equal length and names for each column? Then it will already be a representation of table and of course it's possible to write code which extracts different tables from it.
If by "table" you mean a rank 2 array of boxes? I expect my "table" to be far too large to fit all of it in memory, and my experience so far is that representing this data as a rank 2 array of boxes will result in performance which is too slow to tolerate, by at least an order of magnitude. I'm right now running almost 300 computers, each either running J (or running some other code preparing the data to run J - I need to extract the files from an archive before I can parse them, among other things), and there's enough to do that I do not expect to be done this month. I'm told that I have about 10 terabytes of data to process, but so far I have access to less than half of that - I'm waiting for the rest to be uploaded. Think about what that means for a moment. Now... have I answered your question? If I have not adequately answered your question, please be more specific about what you are asking. Thanks, -- Raul On Thu, Apr 10, 2014 at 9:19 AM, Linda Alvord <[email protected]> wrote: > In the "real" string database you are designing, will you be able to extract > tables of data from it also. > > Linda > > -----Original Message----- > ISent: Thursday, April 10, 2014 12:03 AM > To: Programming forum > Subject: Re: [Jprogramming] "Segmented Strings" > > I do not understand your question. > > Could you uncompress it a little? > > Thanks, > > -- > Raul > > On Wed, Apr 9, 2014 at 11:59 PM, Linda Alvord <[email protected]> wrote: >> Can you still extract tables from it rather than strings? >> >> Linda >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Raul Miller >> Sent: Wednesday, April 09, 2014 9:47 PM >> To: Programming forum >> Subject: Re: [Jprogramming] "Segmented Strings" >> >> Oh, I see how you were thinking. >> >> Actually, the code was secondary - it was only meant to illustrate the >> structure of the data. >> >> In "real life", I will not be using that code to create the segmented >> strings. It'll be more involved. >> >> Thanks, >> >> -- >> Raul >> >> On Wed, Apr 9, 2014 at 9:43 PM, Linda Alvord <[email protected]> wrote: >>> Your example FirstName=:;LF&,each }.0{"1 table is a string creation. >>> >>> Mine ]FN2=: >"0 }.0{"1 table is a table. >>> >>> If you create tables of character dat and tables of the numeric data >>> separately, you could transform the numeric data and then join columns to >>> columns or rows to rows. >>> >>> More dimensions could be created as well and then joined in ways to >>> summarize the useful data and finally rejoin the results. >>> >>> My suggestion is really only related to giving thought to how best to >>> extract and use the string table you have created. >>> >>> Linda >>> >>> -----Original Message----- >>> From: [email protected] >>> [mailto:[email protected]] On Behalf Of Raul Miller >>> Sent: Wednesday, April 09, 2014 10:06 AM >>> To: Programming forum >>> Subject: Re: [Jprogramming] "Segmented Strings" >>> >>> How? >>> >>> Thanks, >>> >>> -- >>> Raul >>> >>> >>> >>> On Wed, Apr 9, 2014 at 3:44 AM, Linda Alvord <[email protected]>wrote: >>> >>>> I would not get rid of your table made of strings. I would access it in >>>> the form of J tables because that is what J does nicely. >>>> >>>> Linda >>>> >>>> -----Original Message----- >>>> From: [email protected] [mailto: >>>> [email protected]] On Behalf Of Raul Miller >>>> Sent: Wednesday, April 09, 2014 2:48 AM >>>> To: Programming forum >>>> Subject: Re: [Jprogramming] "Segmented Strings" >>>> >>>> The plan is that segmented strings are the data in the database. >>>> >>>> There's just too much information to hold it all in memory on a single >>>> machine. >>>> >>>> Thanks, >>>> >>>> -- >>>> Raul >>>> >>>> >>>> On Wed, Apr 9, 2014 at 2:23 AM, Linda Alvord <[email protected] >>>> >wrote: >>>> >>>> > I know almost nothing about large databases, but what is the advantage of >>>> > staying with sstrings after the data base is built? >>>> > >>>> > Once you have your table, or maybe two or more tables of character and >>>> > numeric data, you might "stay in J and make "subtables" which can be >>>> > catenated together and destroyed as needed. You could also do selections >>>> > of subsets more easily. >>>> > >>>> > ]FirstName=:;LF&,each }.0{"1 table >>>> > >>>> > Adam >>>> > Travis >>>> > Donald >>>> > Gary >>>> > James >>>> > Sam >>>> > Travis >>>> > Adam >>>> > Walter >>>> > >>>> > ]FN2=: >"0 }.0{"1 table >>>> > Adam >>>> > Travis >>>> > Donald >>>> > Gary >>>> > James >>>> > Sam >>>> > Travis >>>> > Adam >>>> > Walter >>>> > >>>> > FN2-:FirstName >>>> > 0 >>>> > $FirstName >>>> > 53 >>>> > $FN2 >>>> > 9 6 >>>> > >>>> > Linda >>>> > >>>> > >>>> > -----Original Message----- >>>> > From: [email protected] [mailto: >>>> > [email protected]] On Behalf Of Raul Miller >>>> > Sent: Tuesday, April 08, 2014 8:22 PM >>>> > To: Programming forum >>>> > Subject: Re: [Jprogramming] "Segmented Strings" >>>> > >>>> > I might indeed do that, but in some cases the time to read the file >>>> itself >>>> > will be mostly network transfer time. And, once it's in memory, how it >>>> got >>>> > there isn't really an issue. >>>> > >>>> > Still, it's worth benchmarking. >>>> > >>>> > Thanks, >>>> > >>>> > -- >>>> > Raul >>>> > >>>> > >>>> > On Tue, Apr 8, 2014 at 8:18 PM, Vijay Lulla <[email protected]> >>>> wrote: >>>> > >>>> > > I second memory mapped files and mapped file database. >>>> > > >>>> > > >>>> > > On Tue, Apr 8, 2014 at 4:51 PM, Raul Miller <[email protected]> >>>> > wrote: >>>> > > >>>> > > > It's available for free now, with some limitations: >>>> > > > >>>> > > > http://kx.com/software-download.php >>>> > > > >>>> > > > It'll take me a few years, though, to develop a fluency in K (Q >>>> > actually, >>>> > > > or kdb+ ...) which approaches my fluency in other languages. Anyways, >>>> > > it's >>>> > > > not at all clear that K (or Q or KDB+) would be any better for this >>>> > > > application than J. The grass is always greener on the other side of >>>> > the >>>> > > > fence, especially after you've crossed it? >>>> > > > >>>> > > > Also, if I do my job properly, the language itself becomes irrelevant >>>> > and >>>> > > > the data structures are straightforward enough to allow any arbitrary >>>> > > > language to be used. >>>> > > > >>>> > > > (Meanwhile, I've got J running on OpenBSD, which pleases me.) >>>> > > > >>>> > > > -- >>>> > > > Raul >>>> > > > >>>> > > > Thanks, >>>> > > > >>>> > > > -- >>>> > > > Raul >>>> > > > >>>> > > > >>>> > > > On Tue, Apr 8, 2014 at 2:54 PM, km <[email protected]> wrote: >>>> > > > >>>> > > > > I think I would pay for k's database capability. --Kip Murray >>>> > > > > >>>> > > > > Sent from my iPad >>>> > > > > >>>> > > > > > On Apr 8, 2014, at 12:46 PM, Björn Helgason <[email protected]> >>>> > > wrote: >>>> > > > > > >>>> > > > > > I would take a look at the mapped file database lab to get ideas. >>>> > > > > > >>>> > > > > > - >>>> > > > > > Björn Helgason >>>> > > > > > gsm:6985532 >>>> > > > > > skype:gosiminn >>>> > > > > >> On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote: >>>> > > > > >> >>>> > > > > >> I have thought about using symbols, but the only way to delete >>>> > > symbols >>>> > > > > that >>>> > > > > >> I know of involves exiting J. And, my starting premise was that >>>> I >>>> > > > would >>>> > > > > >> have too much data to fit into memory. >>>> > > > > >> >>>> > > > > >> For some computations it does make sense to start up an >>>> > independent >>>> > > J >>>> > > > > >> session for each part of the calculation (and, in fact, that is >>>> > > what I >>>> > > > > am >>>> > > > > >> doing in a different aspect of dealing with this dataset - it's >>>> > > about >>>> > > > 10 >>>> > > > > >> terabytes, or so I am told - I've not actually seen it all yet >>>> and >>>> > > it >>>> > > > > takes >>>> > > > > >> time to upload it). But for some calculations you need to be >>>> able >>>> > to >>>> > > > > >> correlate between pieces which have been dealt with elsewhere. >>>> > > > > >> >>>> > > > > >> A have similar reservations about fixed-width fields. There's >>>> just >>>> > > too >>>> > > > > much >>>> > > > > >> data for me to predict how wide the fields are going to be. In >>>> > some >>>> > > > > cases I >>>> > > > > >> might actually be going with fixed-width, but that might be too >>>> > > > > inefficient >>>> > > > > >> for the general case. I've one field which would have to be over >>>> > > 100k >>>> > > > in >>>> > > > > >> width if it was fixed width, even though typical cases are >>>> shorter >>>> > > > than >>>> > > > > 1k. >>>> > > > > >> At some point I might go with fixed width, and I expect that >>>> doing >>>> > > so >>>> > > > > will >>>> > > > > >> cause me to lose a few records which will be discovered later in >>>> > > > > >> processing. That might not be a big deal, for this large of a >>>> data >>>> > > > set, >>>> > > > > but >>>> > > > > >> if it's not necessary why bother? >>>> > > > > >> >>>> > > > > >> Finally, Bjorn's suggestion of using mapped files does seem >>>> like a >>>> > > > good >>>> > > > > >> idea, at least for the character data. But that is an >>>> optimization >>>> > > and >>>> > > > > >> optimizations speed up some operations at the expense of slowing >>>> > > down >>>> > > > > other >>>> > > > > >> optimizations. So what really matters is the workload. >>>> > > > > >> >>>> > > > > >> Ultimately, for a dataset this large, it's going to take time. >>>> > > > > >> >>>> > > > > >> Thanks, >>>> > > > > >> >>>> > > > > >> -- >>>> > > > > >> Raul >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner < >>>> [email protected]> >>>> > > > > wrote: >>>> > > > > >>> >>>> > > > > >>> It seems this representation is somewhat similar to how the >>>> > symbol >>>> > > > > table >>>> > > > > >>> stores strings: >>>> > > > > >>> >>>> > > > > >>> http://m.jsoftware.com/help/dictionary/dsco.htm >>>> > > > > >>> >>>> > > > > >>> Also, did you consider using symbols? I've used symbols for >>>> > string >>>> > > > > >> columns >>>> > > > > >>> that contain highly repetitive data, for example, an invoice >>>> > table >>>> > > > with >>>> > > > > >> an >>>> > > > > >>> alpha-numeric SKU. >>>> > > > > >>> >>>> > > > > >>> Thanks for sharing >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller < >>>> > [email protected] >>>> > > > >>>> > > > > >> wrote: >>>> > > > > >>> >>>> > > > > >>>> Consider this example: >>>> > > > > >>>> >>>> > > > > >>>> table=:<;._2;._2]0 :0 >>>> > > > > >>>> First Name,Last Name,Sum, >>>> > > > > >>>> Adam,Wallace,19, >>>> > > > > >>>> Travis,Smith,10, >>>> > > > > >>>> Donald,Barnell,8, >>>> > > > > >>>> Gary,Wallace,27, >>>> > > > > >>>> James,Smith,10, >>>> > > > > >>>> Sam,Johnson,10, >>>> > > > > >>>> Travis,Neal,11, >>>> > > > > >>>> Adam,Campbell,11, >>>> > > > > >>>> Walter,Abbott,13, >>>> > > > > >>>> ) >>>> > > > > >>>> >>>> > > > > >>>> Using boxed strings works great for relatively small sets of >>>> > data. >>>> > > > But >>>> > > > > >>> when >>>> > > > > >>>> things get big, their overhead starts to hurt to much. (Big >>>> > > means: >>>> > > > so >>>> > > > > >>> much >>>> > > > > >>>> data that you'll probably not be able to fit it all in memory >>>> at >>>> > > the >>>> > > > > >> same >>>> > > > > >>>> time. So you need to plan on relatively frequent delays while >>>> > > > reading >>>> > > > > >>> from >>>> > > > > >>>> disk.) >>>> > > > > >>>> >>>> > > > > >>>> One alternative to boxed strings is segmented strings. A >>>> > segmented >>>> > > > > >> string >>>> > > > > >>>> is an argument which could be passed to <;._1. It's basically >>>> > > just a >>>> > > > > >>> string >>>> > > > > >>>> with a prefix delimiter. You can work with these sorts of >>>> > strings >>>> > > > > >>> directly, >>>> > > > > >>>> and achieve results similar to what you would achieve with >>>> boxed >>>> > > > > >> arrays. >>>> > > > > >>>> >>>> > > > > >>>> Segmented strings are a bit clumsier than boxed arrays - you >>>> > lose >>>> > > a >>>> > > > > lot >>>> > > > > >>> of >>>> > > > > >>>> the integrity checks, so if you mess up you probably will not >>>> > see >>>> > > an >>>> > > > > >>> error. >>>> > > > > >>>> So it's probably a good idea to model your code using boxed >>>> > arrays >>>> > > > on >>>> > > > > a >>>> > > > > >>>> small set of data and then convert to segmented representation >>>> > > once >>>> > > > > >>> you're >>>> > > > > >>>> happy with how things work (and once you see a time cost that >>>> > > makes >>>> > > > it >>>> > > > > >>>> worth spending the time to rework your code). >>>> > > > > >>>> >>>> > > > > >>>> Also, to avoid having to use f;._2 (or whatever) every time, >>>> > it's >>>> > > > good >>>> > > > > >> to >>>> > > > > >>>> do an initial pass on the data, to extract its structure. >>>> > > > > >>>> >>>> > > > > >>>> Here's an example: >>>> > > > > >>>> >>>> > > > > >>>> FirstName=:;LF&,each }.0{"1 table >>>> > > > > >>>> >>>> > > > > >>>> LastName=:;LF&,each }.1{"1 table >>>> > > > > >>>> >>>> > > > > >>>> Sum=:;LF&,each }.2{"1 table >>>> > > > > >>>> >>>> > > > > >>>> >>>> > > > > >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),# >>>> > > > > >>>> >>>> > > > > >>>> FirstNameDir=: ssdir FirstName >>>> > > > > >>>> LastNameDir=: ssdir LastName >>>> > > > > >>>> >>>> > > > > >>>> Actually, sum is numeric so let's just use a numeric >>>> > > representation >>>> > > > > for >>>> > > > > >>>> that column >>>> > > > > >>>> >>>> > > > > >>>> Sum=: _&".@> }.2{"1 table >>>> > > > > >>>> >>>> > > > > >>>> Which rows have a last name of Smith? >>>> > > > > >>>> >>>> > > > > >>>> <:({.LastNameDir) I. I.'Smith' E. LastName >>>> > > > > >>>> >>>> > > > > >>>> 1 4 >>>> > > > > >>>> >>>> > > > > >>>> >>>> > > > > >>>> Actually, there's an assumption there that Smith is not part >>>> of >>>> > > some >>>> > > > > >>> larger >>>> > > > > >>>> name. We can include the delimiter in the search if we are >>>> > > concerned >>>> > > > > >>> about >>>> > > > > >>>> that. For even more protection we could append a trailing >>>> > > delimiter >>>> > > > on >>>> > > > > >>> our >>>> > > > > >>>> segmented string and then search for (in this case) >>>> > LF,'Smith',LF. >>>> > > > > >>>> >>>> > > > > >>>> >>>> > > > > >>>> Anyways, let's extract the corresponding sums and first name: >>>> > > > > >>>> >>>> > > > > >>>> >>>> > > > > >>>> 1 4{Sum >>>> > > > > >>>> >>>> > > > > >>>> 10 10 >>>> > > > > >>>> >>>> > > > > >>>> >>>> > > > > >>>> FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir >>>> > > > > >>>> >>>> > > > > >>>> >>>> > > > > >>>> Travis >>>> > > > > >>>> >>>> > > > > >>>> James >>>> > > > > >>>> >>>> > > > > >>>> >>>> > > > > >>>> Note that that last expression is a bit complicated. It's not >>>> so >>>> > > > bad, >>>> > > > > >>>> though, if what you are extracting is a small part of the >>>> total. >>>> > > > And, >>>> > > > > >> in >>>> > > > > >>>> that case, using a list of indices to express a boolean result >>>> > > seems >>>> > > > > >>> like a >>>> > > > > >>>> good thing. You wind up working with set operations >>>> > (intersection >>>> > > > and >>>> > > > > >>>> union) rather than logical operations (and and or). Also, set >>>> > > > > >> difference >>>> > > > > >>>> instead of logical not (dyadic -. instead of monadic -.). >>>> > > > > >>>> >>>> > > > > >>>> >>>> > > > > >>>> intersect=: [ -. -. >>>> > > > > >>>> >>>> > > > > >>>> union=. ~.@, >>>> > > > > >>>> >>>> > > > > >>>> >>>> > > > > >>>> (It looks like I might be using this kind of thing really >>>> soon, >>>> > > so I >>>> > > > > >>>> thought I'd lay down my thoughts here and invite comment.) >>>> > > > > >>>> >>>> > > > > >>>> >>>> > > > > >>>> Thanks, >>>> > > > > >>>> >>>> > > > > >>>> >>>> > > > > >>>> -- >>>> > > > > >>>> >>>> > > > > >>>> Raul >>>> > > > > >>>> >>>> > > > >>>> ---------------------------------------------------------------------- >>>> > > > > >>>> For information about J forums see >>>> > > > > http://www.jsoftware.com/forums.htm >>>> > > > > >>>> >>>> > > > > >>> >>>> > > > >>>> ---------------------------------------------------------------------- >>>> > > > > >>> For information about J forums see >>>> > > > http://www.jsoftware.com/forums.htm >>>> > > > > >>> >>>> > > > > >> >>>> > > ---------------------------------------------------------------------- >>>> > > > > >> For information about J forums see >>>> > > > http://www.jsoftware.com/forums.htm >>>> > > > > >> >>>> > > > > > >>>> > > ---------------------------------------------------------------------- >>>> > > > > > For information about J forums see >>>> > > http://www.jsoftware.com/forums.htm >>>> > > > > >>>> > ---------------------------------------------------------------------- >>>> > > > > For information about J forums see >>>> > http://www.jsoftware.com/forums.htm >>>> > > > > >>>> > > > >>>> ---------------------------------------------------------------------- >>>> > > > For information about J forums see >>>> http://www.jsoftware.com/forums.htm >>>> > > > >>>> > > ---------------------------------------------------------------------- >>>> > > For information about J forums see http://www.jsoftware.com/forums.htm >>>> > > >>>> > ---------------------------------------------------------------------- >>>> > For information about J forums see http://www.jsoftware.com/forums.htm >>>> > >>>> > ---------------------------------------------------------------------- >>>> > For information about J forums see http://www.jsoftware.com/forums.htm >>>> > >>>> ---------------------------------------------------------------------- >>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>> >>>> ---------------------------------------------------------------------- >>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>> >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >>> >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
