You can define a file - a very long string - as a mapped file looking like a two dimensional matrix and from then on treat it as a variable.
So a file 500.000.000 long string can be defined to be 50.000.000 by 10 matrix. - Björn Helgason gsm:6985532 skype:gosiminn On 10.4.2014 16:19, "Raul Miller" <[email protected]> wrote: > If by "table" you mean something that has columns of equal length and > names for each column? Then it will already be a representation of > table and of course it's possible to write code which extracts > different tables from it. > > If by "table" you mean a rank 2 array of boxes? I expect my "table" to > be far too large to fit all of it in memory, and my experience so far > is that representing this data as a rank 2 array of boxes will result > in performance which is too slow to tolerate, by at least an order of > magnitude. > > I'm right now running almost 300 computers, each either running J (or > running some other code preparing the data to run J - I need to > extract the files from an archive before I can parse them, among other > things), and there's enough to do that I do not expect to be done this > month. I'm told that I have about 10 terabytes of data to process, but > so far I have access to less than half of that - I'm waiting for the > rest to be uploaded. > > Think about what that means for a moment. > > Now... have I answered your question? > > If I have not adequately answered your question, please be more > specific about what you are asking. > > Thanks, > > -- > Raul > > > On Thu, Apr 10, 2014 at 9:19 AM, Linda Alvord <[email protected]> > wrote: > > In the "real" string database you are designing, will you be able to > extract tables of data from it also. > > > > Linda > > > > -----Original Message----- > > ISent: Thursday, April 10, 2014 12:03 AM > > To: Programming forum > > Subject: Re: [Jprogramming] "Segmented Strings" > > > > I do not understand your question. > > > > Could you uncompress it a little? > > > > Thanks, > > > > -- > > Raul > > > > On Wed, Apr 9, 2014 at 11:59 PM, Linda Alvord <[email protected]> > wrote: > >> Can you still extract tables from it rather than strings? > >> > >> Linda > >> > >> -----Original Message----- > >> From: [email protected] [mailto: > [email protected]] On Behalf Of Raul Miller > >> Sent: Wednesday, April 09, 2014 9:47 PM > >> To: Programming forum > >> Subject: Re: [Jprogramming] "Segmented Strings" > >> > >> Oh, I see how you were thinking. > >> > >> Actually, the code was secondary - it was only meant to illustrate the > >> structure of the data. > >> > >> In "real life", I will not be using that code to create the segmented > >> strings. It'll be more involved. > >> > >> Thanks, > >> > >> -- > >> Raul > >> > >> On Wed, Apr 9, 2014 at 9:43 PM, Linda Alvord <[email protected]> > wrote: > >>> Your example FirstName=:;LF&,each }.0{"1 table is a string creation. > >>> > >>> Mine ]FN2=: >"0 }.0{"1 table is a table. > >>> > >>> If you create tables of character dat and tables of the numeric data > separately, you could transform the numeric data and then join columns to > columns or rows to rows. > >>> > >>> More dimensions could be created as well and then joined in ways to > summarize the useful data and finally rejoin the results. > >>> > >>> My suggestion is really only related to giving thought to how best to > extract and use the string table you have created. > >>> > >>> Linda > >>> > >>> -----Original Message----- > >>> From: [email protected] [mailto: > [email protected]] On Behalf Of Raul Miller > >>> Sent: Wednesday, April 09, 2014 10:06 AM > >>> To: Programming forum > >>> Subject: Re: [Jprogramming] "Segmented Strings" > >>> > >>> How? > >>> > >>> Thanks, > >>> > >>> -- > >>> Raul > >>> > >>> > >>> > >>> On Wed, Apr 9, 2014 at 3:44 AM, Linda Alvord <[email protected] > >wrote: > >>> > >>>> I would not get rid of your table made of strings. I would access it > in > >>>> the form of J tables because that is what J does nicely. > >>>> > >>>> Linda > >>>> > >>>> -----Original Message----- > >>>> From: [email protected] [mailto: > >>>> [email protected]] On Behalf Of Raul Miller > >>>> Sent: Wednesday, April 09, 2014 2:48 AM > >>>> To: Programming forum > >>>> Subject: Re: [Jprogramming] "Segmented Strings" > >>>> > >>>> The plan is that segmented strings are the data in the database. > >>>> > >>>> There's just too much information to hold it all in memory on a single > >>>> machine. > >>>> > >>>> Thanks, > >>>> > >>>> -- > >>>> Raul > >>>> > >>>> > >>>> On Wed, Apr 9, 2014 at 2:23 AM, Linda Alvord <[email protected] > >>>> >wrote: > >>>> > >>>> > I know almost nothing about large databases, but what is the > advantage of > >>>> > staying with sstrings after the data base is built? > >>>> > > >>>> > Once you have your table, or maybe two or more tables of character > and > >>>> > numeric data, you might "stay in J and make "subtables" which can be > >>>> > catenated together and destroyed as needed. You could also do > selections > >>>> > of subsets more easily. > >>>> > > >>>> > ]FirstName=:;LF&,each }.0{"1 table > >>>> > > >>>> > Adam > >>>> > Travis > >>>> > Donald > >>>> > Gary > >>>> > James > >>>> > Sam > >>>> > Travis > >>>> > Adam > >>>> > Walter > >>>> > > >>>> > ]FN2=: >"0 }.0{"1 table > >>>> > Adam > >>>> > Travis > >>>> > Donald > >>>> > Gary > >>>> > James > >>>> > Sam > >>>> > Travis > >>>> > Adam > >>>> > Walter > >>>> > > >>>> > FN2-:FirstName > >>>> > 0 > >>>> > $FirstName > >>>> > 53 > >>>> > $FN2 > >>>> > 9 6 > >>>> > > >>>> > Linda > >>>> > > >>>> > > >>>> > -----Original Message----- > >>>> > From: [email protected] [mailto: > >>>> > [email protected]] On Behalf Of Raul Miller > >>>> > Sent: Tuesday, April 08, 2014 8:22 PM > >>>> > To: Programming forum > >>>> > Subject: Re: [Jprogramming] "Segmented Strings" > >>>> > > >>>> > I might indeed do that, but in some cases the time to read the file > >>>> itself > >>>> > will be mostly network transfer time. And, once it's in memory, how > it > >>>> got > >>>> > there isn't really an issue. > >>>> > > >>>> > Still, it's worth benchmarking. > >>>> > > >>>> > Thanks, > >>>> > > >>>> > -- > >>>> > Raul > >>>> > > >>>> > > >>>> > On Tue, Apr 8, 2014 at 8:18 PM, Vijay Lulla <[email protected]> > >>>> wrote: > >>>> > > >>>> > > I second memory mapped files and mapped file database. > >>>> > > > >>>> > > > >>>> > > On Tue, Apr 8, 2014 at 4:51 PM, Raul Miller < > [email protected]> > >>>> > wrote: > >>>> > > > >>>> > > > It's available for free now, with some limitations: > >>>> > > > > >>>> > > > http://kx.com/software-download.php > >>>> > > > > >>>> > > > It'll take me a few years, though, to develop a fluency in K (Q > >>>> > actually, > >>>> > > > or kdb+ ...) which approaches my fluency in other languages. > Anyways, > >>>> > > it's > >>>> > > > not at all clear that K (or Q or KDB+) would be any better for > this > >>>> > > > application than J. The grass is always greener on the other > side of > >>>> > the > >>>> > > > fence, especially after you've crossed it? > >>>> > > > > >>>> > > > Also, if I do my job properly, the language itself becomes > irrelevant > >>>> > and > >>>> > > > the data structures are straightforward enough to allow any > arbitrary > >>>> > > > language to be used. > >>>> > > > > >>>> > > > (Meanwhile, I've got J running on OpenBSD, which pleases me.) > >>>> > > > > >>>> > > > -- > >>>> > > > Raul > >>>> > > > > >>>> > > > Thanks, > >>>> > > > > >>>> > > > -- > >>>> > > > Raul > >>>> > > > > >>>> > > > > >>>> > > > On Tue, Apr 8, 2014 at 2:54 PM, km <[email protected]> wrote: > >>>> > > > > >>>> > > > > I think I would pay for k's database capability. --Kip Murray > >>>> > > > > > >>>> > > > > Sent from my iPad > >>>> > > > > > >>>> > > > > > On Apr 8, 2014, at 12:46 PM, Björn Helgason < > [email protected]> > >>>> > > wrote: > >>>> > > > > > > >>>> > > > > > I would take a look at the mapped file database lab to get > ideas. > >>>> > > > > > > >>>> > > > > > - > >>>> > > > > > Björn Helgason > >>>> > > > > > gsm:6985532 > >>>> > > > > > skype:gosiminn > >>>> > > > > >> On 8.4.2014 15:34, "Raul Miller" <[email protected]> > wrote: > >>>> > > > > >> > >>>> > > > > >> I have thought about using symbols, but the only way to > delete > >>>> > > symbols > >>>> > > > > that > >>>> > > > > >> I know of involves exiting J. And, my starting premise was > that > >>>> I > >>>> > > > would > >>>> > > > > >> have too much data to fit into memory. > >>>> > > > > >> > >>>> > > > > >> For some computations it does make sense to start up an > >>>> > independent > >>>> > > J > >>>> > > > > >> session for each part of the calculation (and, in fact, > that is > >>>> > > what I > >>>> > > > > am > >>>> > > > > >> doing in a different aspect of dealing with this dataset - > it's > >>>> > > about > >>>> > > > 10 > >>>> > > > > >> terabytes, or so I am told - I've not actually seen it all > yet > >>>> and > >>>> > > it > >>>> > > > > takes > >>>> > > > > >> time to upload it). But for some calculations you need to > be > >>>> able > >>>> > to > >>>> > > > > >> correlate between pieces which have been dealt with > elsewhere. > >>>> > > > > >> > >>>> > > > > >> A have similar reservations about fixed-width fields. > There's > >>>> just > >>>> > > too > >>>> > > > > much > >>>> > > > > >> data for me to predict how wide the fields are going to > be. In > >>>> > some > >>>> > > > > cases I > >>>> > > > > >> might actually be going with fixed-width, but that might > be too > >>>> > > > > inefficient > >>>> > > > > >> for the general case. I've one field which would have to > be over > >>>> > > 100k > >>>> > > > in > >>>> > > > > >> width if it was fixed width, even though typical cases are > >>>> shorter > >>>> > > > than > >>>> > > > > 1k. > >>>> > > > > >> At some point I might go with fixed width, and I expect > that > >>>> doing > >>>> > > so > >>>> > > > > will > >>>> > > > > >> cause me to lose a few records which will be discovered > later in > >>>> > > > > >> processing. That might not be a big deal, for this large > of a > >>>> data > >>>> > > > set, > >>>> > > > > but > >>>> > > > > >> if it's not necessary why bother? > >>>> > > > > >> > >>>> > > > > >> Finally, Bjorn's suggestion of using mapped files does seem > >>>> like a > >>>> > > > good > >>>> > > > > >> idea, at least for the character data. But that is an > >>>> optimization > >>>> > > and > >>>> > > > > >> optimizations speed up some operations at the expense of > slowing > >>>> > > down > >>>> > > > > other > >>>> > > > > >> optimizations. So what really matters is the workload. > >>>> > > > > >> > >>>> > > > > >> Ultimately, for a dataset this large, it's going to take > time. > >>>> > > > > >> > >>>> > > > > >> Thanks, > >>>> > > > > >> > >>>> > > > > >> -- > >>>> > > > > >> Raul > >>>> > > > > >> > >>>> > > > > >> > >>>> > > > > >> > >>>> > > > > >> > >>>> > > > > >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner < > >>>> [email protected]> > >>>> > > > > wrote: > >>>> > > > > >>> > >>>> > > > > >>> It seems this representation is somewhat similar to how > the > >>>> > symbol > >>>> > > > > table > >>>> > > > > >>> stores strings: > >>>> > > > > >>> > >>>> > > > > >>> http://m.jsoftware.com/help/dictionary/dsco.htm > >>>> > > > > >>> > >>>> > > > > >>> Also, did you consider using symbols? I've used symbols > for > >>>> > string > >>>> > > > > >> columns > >>>> > > > > >>> that contain highly repetitive data, for example, an > invoice > >>>> > table > >>>> > > > with > >>>> > > > > >> an > >>>> > > > > >>> alpha-numeric SKU. > >>>> > > > > >>> > >>>> > > > > >>> Thanks for sharing > >>>> > > > > >>> > >>>> > > > > >>> > >>>> > > > > >>> > >>>> > > > > >>> > >>>> > > > > >>> > >>>> > > > > >>> > >>>> > > > > >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller < > >>>> > [email protected] > >>>> > > > > >>>> > > > > >> wrote: > >>>> > > > > >>> > >>>> > > > > >>>> Consider this example: > >>>> > > > > >>>> > >>>> > > > > >>>> table=:<;._2;._2]0 :0 > >>>> > > > > >>>> First Name,Last Name,Sum, > >>>> > > > > >>>> Adam,Wallace,19, > >>>> > > > > >>>> Travis,Smith,10, > >>>> > > > > >>>> Donald,Barnell,8, > >>>> > > > > >>>> Gary,Wallace,27, > >>>> > > > > >>>> James,Smith,10, > >>>> > > > > >>>> Sam,Johnson,10, > >>>> > > > > >>>> Travis,Neal,11, > >>>> > > > > >>>> Adam,Campbell,11, > >>>> > > > > >>>> Walter,Abbott,13, > >>>> > > > > >>>> ) > >>>> > > > > >>>> > >>>> > > > > >>>> Using boxed strings works great for relatively small > sets of > >>>> > data. > >>>> > > > But > >>>> > > > > >>> when > >>>> > > > > >>>> things get big, their overhead starts to hurt to much. > (Big > >>>> > > means: > >>>> > > > so > >>>> > > > > >>> much > >>>> > > > > >>>> data that you'll probably not be able to fit it all in > memory > >>>> at > >>>> > > the > >>>> > > > > >> same > >>>> > > > > >>>> time. So you need to plan on relatively frequent delays > while > >>>> > > > reading > >>>> > > > > >>> from > >>>> > > > > >>>> disk.) > >>>> > > > > >>>> > >>>> > > > > >>>> One alternative to boxed strings is segmented strings. A > >>>> > segmented > >>>> > > > > >> string > >>>> > > > > >>>> is an argument which could be passed to <;._1. It's > basically > >>>> > > just a > >>>> > > > > >>> string > >>>> > > > > >>>> with a prefix delimiter. You can work with these sorts of > >>>> > strings > >>>> > > > > >>> directly, > >>>> > > > > >>>> and achieve results similar to what you would achieve > with > >>>> boxed > >>>> > > > > >> arrays. > >>>> > > > > >>>> > >>>> > > > > >>>> Segmented strings are a bit clumsier than boxed arrays - > you > >>>> > lose > >>>> > > a > >>>> > > > > lot > >>>> > > > > >>> of > >>>> > > > > >>>> the integrity checks, so if you mess up you probably > will not > >>>> > see > >>>> > > an > >>>> > > > > >>> error. > >>>> > > > > >>>> So it's probably a good idea to model your code using > boxed > >>>> > arrays > >>>> > > > on > >>>> > > > > a > >>>> > > > > >>>> small set of data and then convert to segmented > representation > >>>> > > once > >>>> > > > > >>> you're > >>>> > > > > >>>> happy with how things work (and once you see a time cost > that > >>>> > > makes > >>>> > > > it > >>>> > > > > >>>> worth spending the time to rework your code). > >>>> > > > > >>>> > >>>> > > > > >>>> Also, to avoid having to use f;._2 (or whatever) every > time, > >>>> > it's > >>>> > > > good > >>>> > > > > >> to > >>>> > > > > >>>> do an initial pass on the data, to extract its structure. > >>>> > > > > >>>> > >>>> > > > > >>>> Here's an example: > >>>> > > > > >>>> > >>>> > > > > >>>> FirstName=:;LF&,each }.0{"1 table > >>>> > > > > >>>> > >>>> > > > > >>>> LastName=:;LF&,each }.1{"1 table > >>>> > > > > >>>> > >>>> > > > > >>>> Sum=:;LF&,each }.2{"1 table > >>>> > > > > >>>> > >>>> > > > > >>>> > >>>> > > > > >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),# > >>>> > > > > >>>> > >>>> > > > > >>>> FirstNameDir=: ssdir FirstName > >>>> > > > > >>>> LastNameDir=: ssdir LastName > >>>> > > > > >>>> > >>>> > > > > >>>> Actually, sum is numeric so let's just use a numeric > >>>> > > representation > >>>> > > > > for > >>>> > > > > >>>> that column > >>>> > > > > >>>> > >>>> > > > > >>>> Sum=: _&".@> }.2{"1 table > >>>> > > > > >>>> > >>>> > > > > >>>> Which rows have a last name of Smith? > >>>> > > > > >>>> > >>>> > > > > >>>> <:({.LastNameDir) I. I.'Smith' E. LastName > >>>> > > > > >>>> > >>>> > > > > >>>> 1 4 > >>>> > > > > >>>> > >>>> > > > > >>>> > >>>> > > > > >>>> Actually, there's an assumption there that Smith is not > part > >>>> of > >>>> > > some > >>>> > > > > >>> larger > >>>> > > > > >>>> name. We can include the delimiter in the search if we > are > >>>> > > concerned > >>>> > > > > >>> about > >>>> > > > > >>>> that. For even more protection we could append a trailing > >>>> > > delimiter > >>>> > > > on > >>>> > > > > >>> our > >>>> > > > > >>>> segmented string and then search for (in this case) > >>>> > LF,'Smith',LF. > >>>> > > > > >>>> > >>>> > > > > >>>> > >>>> > > > > >>>> Anyways, let's extract the corresponding sums and first > name: > >>>> > > > > >>>> > >>>> > > > > >>>> > >>>> > > > > >>>> 1 4{Sum > >>>> > > > > >>>> > >>>> > > > > >>>> 10 10 > >>>> > > > > >>>> > >>>> > > > > >>>> > >>>> > > > > >>>> FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir > >>>> > > > > >>>> > >>>> > > > > >>>> > >>>> > > > > >>>> Travis > >>>> > > > > >>>> > >>>> > > > > >>>> James > >>>> > > > > >>>> > >>>> > > > > >>>> > >>>> > > > > >>>> Note that that last expression is a bit complicated. > It's not > >>>> so > >>>> > > > bad, > >>>> > > > > >>>> though, if what you are extracting is a small part of the > >>>> total. > >>>> > > > And, > >>>> > > > > >> in > >>>> > > > > >>>> that case, using a list of indices to express a boolean > result > >>>> > > seems > >>>> > > > > >>> like a > >>>> > > > > >>>> good thing. You wind up working with set operations > >>>> > (intersection > >>>> > > > and > >>>> > > > > >>>> union) rather than logical operations (and and or). > Also, set > >>>> > > > > >> difference > >>>> > > > > >>>> instead of logical not (dyadic -. instead of monadic -.). > >>>> > > > > >>>> > >>>> > > > > >>>> > >>>> > > > > >>>> intersect=: [ -. -. > >>>> > > > > >>>> > >>>> > > > > >>>> union=. ~.@, > >>>> > > > > >>>> > >>>> > > > > >>>> > >>>> > > > > >>>> (It looks like I might be using this kind of thing really > >>>> soon, > >>>> > > so I > >>>> > > > > >>>> thought I'd lay down my thoughts here and invite > comment.) > >>>> > > > > >>>> > >>>> > > > > >>>> > >>>> > > > > >>>> Thanks, > >>>> > > > > >>>> > >>>> > > > > >>>> > >>>> > > > > >>>> -- > >>>> > > > > >>>> > >>>> > > > > >>>> Raul > >>>> > > > > >>>> > >>>> > > > > >>>> ---------------------------------------------------------------------- > >>>> > > > > >>>> For information about J forums see > >>>> > > > > http://www.jsoftware.com/forums.htm > >>>> > > > > >>>> > >>>> > > > > >>> > >>>> > > > > >>>> ---------------------------------------------------------------------- > >>>> > > > > >>> For information about J forums see > >>>> > > > http://www.jsoftware.com/forums.htm > >>>> > > > > >>> > >>>> > > > > >> > >>>> > > > ---------------------------------------------------------------------- > >>>> > > > > >> For information about J forums see > >>>> > > > http://www.jsoftware.com/forums.htm > >>>> > > > > >> > >>>> > > > > > > >>>> > > > ---------------------------------------------------------------------- > >>>> > > > > > For information about J forums see > >>>> > > http://www.jsoftware.com/forums.htm > >>>> > > > > > >>>> > > ---------------------------------------------------------------------- > >>>> > > > > For information about J forums see > >>>> > http://www.jsoftware.com/forums.htm > >>>> > > > > > >>>> > > > > >>>> ---------------------------------------------------------------------- > >>>> > > > For information about J forums see > >>>> http://www.jsoftware.com/forums.htm > >>>> > > > > >>>> > > > ---------------------------------------------------------------------- > >>>> > > For information about J forums see > http://www.jsoftware.com/forums.htm > >>>> > > > >>>> > > ---------------------------------------------------------------------- > >>>> > For information about J forums see > http://www.jsoftware.com/forums.htm > >>>> > > >>>> > > ---------------------------------------------------------------------- > >>>> > For information about J forums see > http://www.jsoftware.com/forums.htm > >>>> > > >>>> ---------------------------------------------------------------------- > >>>> For information about J forums see > http://www.jsoftware.com/forums.htm > >>>> > >>>> ---------------------------------------------------------------------- > >>>> For information about J forums see > http://www.jsoftware.com/forums.htm > >>>> > >>> ---------------------------------------------------------------------- > >>> For information about J forums see http://www.jsoftware.com/forums.htm > >>> > >>> ---------------------------------------------------------------------- > >>> For information about J forums see http://www.jsoftware.com/forums.htm > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
