Re: [Jprogramming] "Segmented Strings"

Björn Helgason Thu, 10 Apr 2014 11:56:57 -0700

You can define a file - a very long string - as a mapped file looking like
a two dimensional matrix and from then on treat it as a variable.


So a file 500.000.000 long string can be defined to be 50.000.000 by 10
matrix.

-
Björn Helgason
gsm:6985532
skype:gosiminn
On 10.4.2014 16:19, "Raul Miller" <[email protected]> wrote:

> If by "table" you mean something that has columns of equal length and
> names for each column? Then it will already be a representation of
> table and of course it's possible to write code which extracts
> different tables from it.
>
> If by "table" you mean a rank 2 array of boxes? I expect my "table" to
> be far too large to fit all of it in memory, and my experience so far
> is that representing this data as a rank 2 array of boxes will result
> in performance which is too slow to tolerate, by at least an order of
> magnitude.
>
> I'm right now running almost 300 computers, each either running J (or
> running some other code preparing the data to run J - I need to
> extract the files from an archive before I can parse them, among other
> things), and there's enough to do that I do not expect to be done this
> month. I'm told that I have about 10 terabytes of data to process, but
> so far I have access to less than half of that - I'm waiting for the
> rest to be uploaded.
>
> Think about what that means for a moment.
>
> Now... have I answered your question?
>
> If I have not adequately answered your question, please be more
> specific about what you are asking.
>
> Thanks,
>
> --
> Raul
>
>
> On Thu, Apr 10, 2014 at 9:19 AM, Linda Alvord <[email protected]>
> wrote:
> > In the "real" string database you are designing, will you be able to
> extract tables of data from it also.
> >
> > Linda
> >
> > -----Original Message-----
> > ISent: Thursday, April 10, 2014 12:03 AM
> > To: Programming forum
> > Subject: Re: [Jprogramming] "Segmented Strings"
> >
> > I do not understand your question.
> >
> > Could you uncompress it a little?
> >
> > Thanks,
> >
> > --
> > Raul
> >
> > On Wed, Apr 9, 2014 at 11:59 PM, Linda Alvord <[email protected]>
> wrote:
> >> Can you still extract tables from it rather than strings?
> >>
> >> Linda
> >>
> >> -----Original Message-----
> >> From: [email protected] [mailto:
> [email protected]] On Behalf Of Raul Miller
> >> Sent: Wednesday, April 09, 2014 9:47 PM
> >> To: Programming forum
> >> Subject: Re: [Jprogramming] "Segmented Strings"
> >>
> >> Oh, I see how you were thinking.
> >>
> >> Actually, the code was secondary - it was only meant to illustrate the
> >> structure of the data.
> >>
> >> In "real life", I will not be using that code to create the segmented
> >> strings. It'll be more involved.
> >>
> >> Thanks,
> >>
> >> --
> >> Raul
> >>
> >> On Wed, Apr 9, 2014 at 9:43 PM, Linda Alvord <[email protected]>
> wrote:
> >>> Your example  FirstName=:;LF&,each }.0{"1 table  is a string creation.
> >>>
> >>> Mine  ]FN2=:   >"0 }.0{"1 table  is a table.
> >>>
> >>> If you create tables of character dat and tables of the numeric data
> separately, you could transform the numeric data and then join columns to
> columns or rows to rows.
> >>>
> >>> More dimensions could be created as well and then joined in ways to
> summarize the useful data and finally rejoin the results.
> >>>
> >>> My suggestion is really only related to giving thought to how best to
> extract and use the string table you have created.
> >>>
> >>> Linda
> >>>
> >>> -----Original Message-----
> >>> From: [email protected] [mailto:
> [email protected]] On Behalf Of Raul Miller
> >>> Sent: Wednesday, April 09, 2014 10:06 AM
> >>> To: Programming forum
> >>> Subject: Re: [Jprogramming] "Segmented Strings"
> >>>
> >>> How?
> >>>
> >>> Thanks,
> >>>
> >>> --
> >>> Raul
> >>>
> >>>
> >>>
> >>> On Wed, Apr 9, 2014 at 3:44 AM, Linda Alvord <[email protected]
> >wrote:
> >>>
> >>>> I would not get rid of your table made of strings.  I would access it
> in
> >>>> the form of J tables because that is what J does nicely.
> >>>>
> >>>> Linda
> >>>>
> >>>> -----Original Message-----
> >>>> From: [email protected] [mailto:
> >>>> [email protected]] On Behalf Of Raul Miller
> >>>> Sent: Wednesday, April 09, 2014 2:48 AM
> >>>> To: Programming forum
> >>>> Subject: Re: [Jprogramming] "Segmented Strings"
> >>>>
> >>>> The plan is that segmented strings are the data in the database.
> >>>>
> >>>> There's just too much information to hold it all in memory on a single
> >>>> machine.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> --
> >>>> Raul
> >>>>
> >>>>
> >>>> On Wed, Apr 9, 2014 at 2:23 AM, Linda Alvord <[email protected]
> >>>> >wrote:
> >>>>
> >>>> > I know almost nothing about large databases, but what is the
> advantage of
> >>>> > staying with sstrings after the data base is built?
> >>>> >
> >>>> > Once you have your table, or maybe two or more tables of character
> and
> >>>> > numeric data, you might "stay in J and make "subtables" which can be
> >>>> > catenated together and destroyed as needed.  You could also do
> selections
> >>>> > of subsets more easily.
> >>>> >
> >>>> >   ]FirstName=:;LF&,each }.0{"1 table
> >>>> >
> >>>> > Adam
> >>>> > Travis
> >>>> > Donald
> >>>> > Gary
> >>>> > James
> >>>> > Sam
> >>>> > Travis
> >>>> > Adam
> >>>> > Walter
> >>>> >
> >>>> >    ]FN2=:   >"0 }.0{"1 table
> >>>> > Adam
> >>>> > Travis
> >>>> > Donald
> >>>> > Gary
> >>>> > James
> >>>> > Sam
> >>>> > Travis
> >>>> > Adam
> >>>> > Walter
> >>>> >
> >>>> >    FN2-:FirstName
> >>>> > 0
> >>>> >    $FirstName
> >>>> > 53
> >>>> >    $FN2
> >>>> > 9 6
> >>>> >
> >>>> > Linda
> >>>> >
> >>>> >
> >>>> > -----Original Message-----
> >>>> > From: [email protected] [mailto:
> >>>> > [email protected]] On Behalf Of Raul Miller
> >>>> > Sent: Tuesday, April 08, 2014 8:22 PM
> >>>> > To: Programming forum
> >>>> > Subject: Re: [Jprogramming] "Segmented Strings"
> >>>> >
> >>>> > I might indeed do that, but in some cases the time to read the file
> >>>> itself
> >>>> > will be mostly network transfer time. And, once it's in memory, how
> it
> >>>> got
> >>>> > there isn't really an issue.
> >>>> >
> >>>> > Still, it's worth benchmarking.
> >>>> >
> >>>> > Thanks,
> >>>> >
> >>>> > --
> >>>> > Raul
> >>>> >
> >>>> >
> >>>> > On Tue, Apr 8, 2014 at 8:18 PM, Vijay Lulla <[email protected]>
> >>>> wrote:
> >>>> >
> >>>> > > I second memory mapped files and mapped file database.
> >>>> > >
> >>>> > >
> >>>> > > On Tue, Apr 8, 2014 at 4:51 PM, Raul Miller <
> [email protected]>
> >>>> > wrote:
> >>>> > >
> >>>> > > > It's available for free now, with some limitations:
> >>>> > > >
> >>>> > > > http://kx.com/software-download.php
> >>>> > > >
> >>>> > > > It'll take me a few years, though, to develop a fluency in K (Q
> >>>> > actually,
> >>>> > > > or kdb+ ...) which approaches my fluency in other languages.
> Anyways,
> >>>> > > it's
> >>>> > > > not at all clear that K (or Q or KDB+) would be any better for
> this
> >>>> > > > application than J. The grass is always greener on the other
> side of
> >>>> > the
> >>>> > > > fence, especially after you've crossed it?
> >>>> > > >
> >>>> > > > Also, if I do my job properly, the language itself becomes
> irrelevant
> >>>> > and
> >>>> > > > the data structures are straightforward enough to allow any
> arbitrary
> >>>> > > > language to be used.
> >>>> > > >
> >>>> > > > (Meanwhile, I've got J running on OpenBSD, which pleases me.)
> >>>> > > >
> >>>> > > > --
> >>>> > > > Raul
> >>>> > > >
> >>>> > > > Thanks,
> >>>> > > >
> >>>> > > > --
> >>>> > > > Raul
> >>>> > > >
> >>>> > > >
> >>>> > > > On Tue, Apr 8, 2014 at 2:54 PM, km <[email protected]> wrote:
> >>>> > > >
> >>>> > > > > I think I would pay for k's database capability.  --Kip Murray
> >>>> > > > >
> >>>> > > > > Sent from my iPad
> >>>> > > > >
> >>>> > > > > > On Apr 8, 2014, at 12:46 PM, Björn Helgason <
> [email protected]>
> >>>> > > wrote:
> >>>> > > > > >
> >>>> > > > > > I would take a look at the mapped file database lab to get
> ideas.
> >>>> > > > > >
> >>>> > > > > > -
> >>>> > > > > > Björn Helgason
> >>>> > > > > > gsm:6985532
> >>>> > > > > > skype:gosiminn
> >>>> > > > > >> On 8.4.2014 15:34, "Raul Miller" <[email protected]>
> wrote:
> >>>> > > > > >>
> >>>> > > > > >> I have thought about using symbols, but the only way to
> delete
> >>>> > > symbols
> >>>> > > > > that
> >>>> > > > > >> I know of involves exiting J. And, my starting premise was
> that
> >>>> I
> >>>> > > > would
> >>>> > > > > >> have too much data to fit into memory.
> >>>> > > > > >>
> >>>> > > > > >> For some computations it does make sense to start up an
> >>>> > independent
> >>>> > > J
> >>>> > > > > >> session for each part of the calculation (and, in fact,
> that is
> >>>> > > what I
> >>>> > > > > am
> >>>> > > > > >> doing in a different aspect of dealing with this dataset -
> it's
> >>>> > > about
> >>>> > > > 10
> >>>> > > > > >> terabytes, or so I am told - I've not actually seen it all
> yet
> >>>> and
> >>>> > > it
> >>>> > > > > takes
> >>>> > > > > >> time to upload it). But for some calculations you need to
> be
> >>>> able
> >>>> > to
> >>>> > > > > >> correlate between pieces which have been dealt with
> elsewhere.
> >>>> > > > > >>
> >>>> > > > > >> A have similar reservations about fixed-width fields.
> There's
> >>>> just
> >>>> > > too
> >>>> > > > > much
> >>>> > > > > >> data for me to predict how wide the fields are going to
> be. In
> >>>> > some
> >>>> > > > > cases I
> >>>> > > > > >> might actually be going with fixed-width, but that might
> be too
> >>>> > > > > inefficient
> >>>> > > > > >> for the general case. I've one field which would have to
> be over
> >>>> > > 100k
> >>>> > > > in
> >>>> > > > > >> width if it was fixed width, even though typical cases are
> >>>> shorter
> >>>> > > > than
> >>>> > > > > 1k.
> >>>> > > > > >> At some point I might go with fixed width, and I expect
> that
> >>>> doing
> >>>> > > so
> >>>> > > > > will
> >>>> > > > > >> cause me to lose a few records which will be discovered
> later in
> >>>> > > > > >> processing. That might not be a big deal, for this large
> of a
> >>>> data
> >>>> > > > set,
> >>>> > > > > but
> >>>> > > > > >> if it's not necessary why bother?
> >>>> > > > > >>
> >>>> > > > > >> Finally, Bjorn's suggestion of using mapped files does seem
> >>>> like a
> >>>> > > > good
> >>>> > > > > >> idea, at least for the character data. But that is an
> >>>> optimization
> >>>> > > and
> >>>> > > > > >> optimizations speed up some operations at the expense of
> slowing
> >>>> > > down
> >>>> > > > > other
> >>>> > > > > >> optimizations. So what really matters is the workload.
> >>>> > > > > >>
> >>>> > > > > >> Ultimately, for a dataset this large, it's going to take
> time.
> >>>> > > > > >>
> >>>> > > > > >> Thanks,
> >>>> > > > > >>
> >>>> > > > > >> --
> >>>> > > > > >> Raul
> >>>> > > > > >>
> >>>> > > > > >>
> >>>> > > > > >>
> >>>> > > > > >>
> >>>> > > > > >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <
> >>>> [email protected]>
> >>>> > > > > wrote:
> >>>> > > > > >>>
> >>>> > > > > >>> It seems this representation is somewhat similar to how
> the
> >>>> > symbol
> >>>> > > > > table
> >>>> > > > > >>> stores strings:
> >>>> > > > > >>>
> >>>> > > > > >>> http://m.jsoftware.com/help/dictionary/dsco.htm
> >>>> > > > > >>>
> >>>> > > > > >>> Also, did you consider using symbols? I've used symbols
> for
> >>>> > string
> >>>> > > > > >> columns
> >>>> > > > > >>> that contain highly repetitive data, for example, an
> invoice
> >>>> > table
> >>>> > > > with
> >>>> > > > > >> an
> >>>> > > > > >>> alpha-numeric SKU.
> >>>> > > > > >>>
> >>>> > > > > >>> Thanks for sharing
> >>>> > > > > >>>
> >>>> > > > > >>>
> >>>> > > > > >>>
> >>>> > > > > >>>
> >>>> > > > > >>>
> >>>> > > > > >>>
> >>>> > > > > >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <
> >>>> > [email protected]
> >>>> > > >
> >>>> > > > > >> wrote:
> >>>> > > > > >>>
> >>>> > > > > >>>> Consider this example:
> >>>> > > > > >>>>
> >>>> > > > > >>>> table=:<;._2;._2]0 :0
> >>>> > > > > >>>> First Name,Last Name,Sum,
> >>>> > > > > >>>> Adam,Wallace,19,
> >>>> > > > > >>>> Travis,Smith,10,
> >>>> > > > > >>>> Donald,Barnell,8,
> >>>> > > > > >>>> Gary,Wallace,27,
> >>>> > > > > >>>> James,Smith,10,
> >>>> > > > > >>>> Sam,Johnson,10,
> >>>> > > > > >>>> Travis,Neal,11,
> >>>> > > > > >>>> Adam,Campbell,11,
> >>>> > > > > >>>> Walter,Abbott,13,
> >>>> > > > > >>>> )
> >>>> > > > > >>>>
> >>>> > > > > >>>> Using boxed strings works great for relatively small
> sets of
> >>>> > data.
> >>>> > > > But
> >>>> > > > > >>> when
> >>>> > > > > >>>> things get big, their overhead starts to hurt to much.
>  (Big
> >>>> > > means:
> >>>> > > > so
> >>>> > > > > >>> much
> >>>> > > > > >>>> data that you'll probably not be able to fit it all in
> memory
> >>>> at
> >>>> > > the
> >>>> > > > > >> same
> >>>> > > > > >>>> time. So you need to plan on relatively frequent delays
> while
> >>>> > > > reading
> >>>> > > > > >>> from
> >>>> > > > > >>>> disk.)
> >>>> > > > > >>>>
> >>>> > > > > >>>> One alternative to boxed strings is segmented strings. A
> >>>> > segmented
> >>>> > > > > >> string
> >>>> > > > > >>>> is an argument which could be passed to <;._1. It's
> basically
> >>>> > > just a
> >>>> > > > > >>> string
> >>>> > > > > >>>> with a prefix delimiter. You can work with these sorts of
> >>>> > strings
> >>>> > > > > >>> directly,
> >>>> > > > > >>>> and achieve results similar to what you would achieve
> with
> >>>> boxed
> >>>> > > > > >> arrays.
> >>>> > > > > >>>>
> >>>> > > > > >>>> Segmented strings are a bit clumsier than boxed arrays -
> you
> >>>> > lose
> >>>> > > a
> >>>> > > > > lot
> >>>> > > > > >>> of
> >>>> > > > > >>>> the integrity checks, so if you mess up you probably
> will not
> >>>> > see
> >>>> > > an
> >>>> > > > > >>> error.
> >>>> > > > > >>>> So it's probably a good idea to model your code using
> boxed
> >>>> > arrays
> >>>> > > > on
> >>>> > > > > a
> >>>> > > > > >>>> small set of data and then convert to segmented
> representation
> >>>> > > once
> >>>> > > > > >>> you're
> >>>> > > > > >>>> happy with how things work (and once you see a time cost
> that
> >>>> > > makes
> >>>> > > > it
> >>>> > > > > >>>> worth spending the time to rework your code).
> >>>> > > > > >>>>
> >>>> > > > > >>>> Also, to avoid having to use f;._2 (or whatever) every
> time,
> >>>> > it's
> >>>> > > > good
> >>>> > > > > >> to
> >>>> > > > > >>>> do an initial pass on the data, to extract its structure.
> >>>> > > > > >>>>
> >>>> > > > > >>>> Here's an example:
> >>>> > > > > >>>>
> >>>> > > > > >>>> FirstName=:;LF&,each }.0{"1 table
> >>>> > > > > >>>>
> >>>> > > > > >>>> LastName=:;LF&,each }.1{"1 table
> >>>> > > > > >>>>
> >>>> > > > > >>>> Sum=:;LF&,each }.2{"1 table
> >>>> > > > > >>>>
> >>>> > > > > >>>>
> >>>> > > > > >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),#
> >>>> > > > > >>>>
> >>>> > > > > >>>> FirstNameDir=: ssdir FirstName
> >>>> > > > > >>>> LastNameDir=: ssdir LastName
> >>>> > > > > >>>>
> >>>> > > > > >>>> Actually, sum is numeric so let's just use a numeric
> >>>> > > representation
> >>>> > > > > for
> >>>> > > > > >>>> that column
> >>>> > > > > >>>>
> >>>> > > > > >>>> Sum=: _&".@> }.2{"1 table
> >>>> > > > > >>>>
> >>>> > > > > >>>> Which rows have a last name of Smith?
> >>>> > > > > >>>>
> >>>> > > > > >>>>   <:({.LastNameDir) I. I.'Smith' E. LastName
> >>>> > > > > >>>>
> >>>> > > > > >>>> 1 4
> >>>> > > > > >>>>
> >>>> > > > > >>>>
> >>>> > > > > >>>> Actually, there's an assumption there that Smith is not
> part
> >>>> of
> >>>> > > some
> >>>> > > > > >>> larger
> >>>> > > > > >>>> name. We can include the delimiter in the search if we
> are
> >>>> > > concerned
> >>>> > > > > >>> about
> >>>> > > > > >>>> that. For even more protection we could append a trailing
> >>>> > > delimiter
> >>>> > > > on
> >>>> > > > > >>> our
> >>>> > > > > >>>> segmented string and then search for (in this case)
> >>>> > LF,'Smith',LF.
> >>>> > > > > >>>>
> >>>> > > > > >>>>
> >>>> > > > > >>>> Anyways, let's extract the corresponding sums and first
> name:
> >>>> > > > > >>>>
> >>>> > > > > >>>>
> >>>> > > > > >>>>   1 4{Sum
> >>>> > > > > >>>>
> >>>> > > > > >>>> 10 10
> >>>> > > > > >>>>
> >>>> > > > > >>>>
> >>>> > > > > >>>>   FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir
> >>>> > > > > >>>>
> >>>> > > > > >>>>
> >>>> > > > > >>>> Travis
> >>>> > > > > >>>>
> >>>> > > > > >>>> James
> >>>> > > > > >>>>
> >>>> > > > > >>>>
> >>>> > > > > >>>> Note that that last expression is a bit complicated.
> It's not
> >>>> so
> >>>> > > > bad,
> >>>> > > > > >>>> though, if what you are extracting is a small part of the
> >>>> total.
> >>>> > > > And,
> >>>> > > > > >> in
> >>>> > > > > >>>> that case, using a list of indices to express a boolean
> result
> >>>> > > seems
> >>>> > > > > >>> like a
> >>>> > > > > >>>> good thing. You wind up working with set operations
> >>>> > (intersection
> >>>> > > > and
> >>>> > > > > >>>> union) rather than logical operations (and and or).
> Also, set
> >>>> > > > > >> difference
> >>>> > > > > >>>> instead of logical not (dyadic -. instead of monadic -.).
> >>>> > > > > >>>>
> >>>> > > > > >>>>
> >>>> > > > > >>>> intersect=: [ -. -.
> >>>> > > > > >>>>
> >>>> > > > > >>>> union=. ~.@,
> >>>> > > > > >>>>
> >>>> > > > > >>>>
> >>>> > > > > >>>> (It looks like I might be using this kind of thing really
> >>>> soon,
> >>>> > > so I
> >>>> > > > > >>>> thought I'd lay down my thoughts here and invite
> comment.)
> >>>> > > > > >>>>
> >>>> > > > > >>>>
> >>>> > > > > >>>> Thanks,
> >>>> > > > > >>>>
> >>>> > > > > >>>>
> >>>> > > > > >>>> --
> >>>> > > > > >>>>
> >>>> > > > > >>>> Raul
> >>>> > > > > >>>>
> >>>> > > >
> >>>> ----------------------------------------------------------------------
> >>>> > > > > >>>> For information about J forums see
> >>>> > > > > http://www.jsoftware.com/forums.htm
> >>>> > > > > >>>>
> >>>> > > > > >>>
> >>>> > > >
> >>>> ----------------------------------------------------------------------
> >>>> > > > > >>> For information about J forums see
> >>>> > > > http://www.jsoftware.com/forums.htm
> >>>> > > > > >>>
> >>>> > > > > >>
> >>>> > >
> ----------------------------------------------------------------------
> >>>> > > > > >> For information about J forums see
> >>>> > > > http://www.jsoftware.com/forums.htm
> >>>> > > > > >>
> >>>> > > > > >
> >>>> > >
> ----------------------------------------------------------------------
> >>>> > > > > > For information about J forums see
> >>>> > > http://www.jsoftware.com/forums.htm
> >>>> > > > >
> >>>> >
> ----------------------------------------------------------------------
> >>>> > > > > For information about J forums see
> >>>> > http://www.jsoftware.com/forums.htm
> >>>> > > > >
> >>>> > > >
> >>>> ----------------------------------------------------------------------
> >>>> > > > For information about J forums see
> >>>> http://www.jsoftware.com/forums.htm
> >>>> > > >
> >>>> > >
> ----------------------------------------------------------------------
> >>>> > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>>> > >
> >>>> >
> ----------------------------------------------------------------------
> >>>> > For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>>> >
> >>>> >
> ----------------------------------------------------------------------
> >>>> > For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>>> >
> >>>> ----------------------------------------------------------------------
> >>>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>>>
> >>>> ----------------------------------------------------------------------
> >>>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>>>
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >>>
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] "Segmented Strings"

Reply via email to