Re: [Jprogramming] "Segmented Strings"

Raul Miller Thu, 10 Apr 2014 09:19:53 -0700

If by "table" you mean something that has columns of equal length and
names for each column? Then it will already be a representation of
table and of course it's possible to write code which extracts
different tables from it.


If by "table" you mean a rank 2 array of boxes? I expect my "table" to
be far too large to fit all of it in memory, and my experience so far
is that representing this data as a rank 2 array of boxes will result
in performance which is too slow to tolerate, by at least an order of
magnitude.

I'm right now running almost 300 computers, each either running J (or
running some other code preparing the data to run J - I need to
extract the files from an archive before I can parse them, among other
things), and there's enough to do that I do not expect to be done this
month. I'm told that I have about 10 terabytes of data to process, but
so far I have access to less than half of that - I'm waiting for the
rest to be uploaded.

Think about what that means for a moment.

Now... have I answered your question?

If I have not adequately answered your question, please be more
specific about what you are asking.

Thanks,

-- 
Raul


On Thu, Apr 10, 2014 at 9:19 AM, Linda Alvord <[email protected]> wrote:
> In the "real" string database you are designing, will you be able to extract 
> tables of data from it also.
>
> Linda
>
> -----Original Message-----
> ISent: Thursday, April 10, 2014 12:03 AM
> To: Programming forum
> Subject: Re: [Jprogramming] "Segmented Strings"
>
> I do not understand your question.
>
> Could you uncompress it a little?
>
> Thanks,
>
> --
> Raul
>
> On Wed, Apr 9, 2014 at 11:59 PM, Linda Alvord <[email protected]> wrote:
>> Can you still extract tables from it rather than strings?
>>
>> Linda
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Raul Miller
>> Sent: Wednesday, April 09, 2014 9:47 PM
>> To: Programming forum
>> Subject: Re: [Jprogramming] "Segmented Strings"
>>
>> Oh, I see how you were thinking.
>>
>> Actually, the code was secondary - it was only meant to illustrate the
>> structure of the data.
>>
>> In "real life", I will not be using that code to create the segmented
>> strings. It'll be more involved.
>>
>> Thanks,
>>
>> --
>> Raul
>>
>> On Wed, Apr 9, 2014 at 9:43 PM, Linda Alvord <[email protected]> wrote:
>>> Your example  FirstName=:;LF&,each }.0{"1 table  is a string creation.
>>>
>>> Mine  ]FN2=:   >"0 }.0{"1 table  is a table.
>>>
>>> If you create tables of character dat and tables of the numeric data 
>>> separately, you could transform the numeric data and then join columns to 
>>> columns or rows to rows.
>>>
>>> More dimensions could be created as well and then joined in ways to 
>>> summarize the useful data and finally rejoin the results.
>>>
>>> My suggestion is really only related to giving thought to how best to 
>>> extract and use the string table you have created.
>>>
>>> Linda
>>>
>>> -----Original Message-----
>>> From: [email protected] 
>>> [mailto:[email protected]] On Behalf Of Raul Miller
>>> Sent: Wednesday, April 09, 2014 10:06 AM
>>> To: Programming forum
>>> Subject: Re: [Jprogramming] "Segmented Strings"
>>>
>>> How?
>>>
>>> Thanks,
>>>
>>> --
>>> Raul
>>>
>>>
>>>
>>> On Wed, Apr 9, 2014 at 3:44 AM, Linda Alvord <[email protected]>wrote:
>>>
>>>> I would not get rid of your table made of strings.  I would access it in
>>>> the form of J tables because that is what J does nicely.
>>>>
>>>> Linda
>>>>
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:
>>>> [email protected]] On Behalf Of Raul Miller
>>>> Sent: Wednesday, April 09, 2014 2:48 AM
>>>> To: Programming forum
>>>> Subject: Re: [Jprogramming] "Segmented Strings"
>>>>
>>>> The plan is that segmented strings are the data in the database.
>>>>
>>>> There's just too much information to hold it all in memory on a single
>>>> machine.
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Raul
>>>>
>>>>
>>>> On Wed, Apr 9, 2014 at 2:23 AM, Linda Alvord <[email protected]
>>>> >wrote:
>>>>
>>>> > I know almost nothing about large databases, but what is the advantage of
>>>> > staying with sstrings after the data base is built?
>>>> >
>>>> > Once you have your table, or maybe two or more tables of character and
>>>> > numeric data, you might "stay in J and make "subtables" which can be
>>>> > catenated together and destroyed as needed.  You could also do selections
>>>> > of subsets more easily.
>>>> >
>>>> >   ]FirstName=:;LF&,each }.0{"1 table
>>>> >
>>>> > Adam
>>>> > Travis
>>>> > Donald
>>>> > Gary
>>>> > James
>>>> > Sam
>>>> > Travis
>>>> > Adam
>>>> > Walter
>>>> >
>>>> >    ]FN2=:   >"0 }.0{"1 table
>>>> > Adam
>>>> > Travis
>>>> > Donald
>>>> > Gary
>>>> > James
>>>> > Sam
>>>> > Travis
>>>> > Adam
>>>> > Walter
>>>> >
>>>> >    FN2-:FirstName
>>>> > 0
>>>> >    $FirstName
>>>> > 53
>>>> >    $FN2
>>>> > 9 6
>>>> >
>>>> > Linda
>>>> >
>>>> >
>>>> > -----Original Message-----
>>>> > From: [email protected] [mailto:
>>>> > [email protected]] On Behalf Of Raul Miller
>>>> > Sent: Tuesday, April 08, 2014 8:22 PM
>>>> > To: Programming forum
>>>> > Subject: Re: [Jprogramming] "Segmented Strings"
>>>> >
>>>> > I might indeed do that, but in some cases the time to read the file
>>>> itself
>>>> > will be mostly network transfer time. And, once it's in memory, how it
>>>> got
>>>> > there isn't really an issue.
>>>> >
>>>> > Still, it's worth benchmarking.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > --
>>>> > Raul
>>>> >
>>>> >
>>>> > On Tue, Apr 8, 2014 at 8:18 PM, Vijay Lulla <[email protected]>
>>>> wrote:
>>>> >
>>>> > > I second memory mapped files and mapped file database.
>>>> > >
>>>> > >
>>>> > > On Tue, Apr 8, 2014 at 4:51 PM, Raul Miller <[email protected]>
>>>> > wrote:
>>>> > >
>>>> > > > It's available for free now, with some limitations:
>>>> > > >
>>>> > > > http://kx.com/software-download.php
>>>> > > >
>>>> > > > It'll take me a few years, though, to develop a fluency in K (Q
>>>> > actually,
>>>> > > > or kdb+ ...) which approaches my fluency in other languages. Anyways,
>>>> > > it's
>>>> > > > not at all clear that K (or Q or KDB+) would be any better for this
>>>> > > > application than J. The grass is always greener on the other side of
>>>> > the
>>>> > > > fence, especially after you've crossed it?
>>>> > > >
>>>> > > > Also, if I do my job properly, the language itself becomes irrelevant
>>>> > and
>>>> > > > the data structures are straightforward enough to allow any arbitrary
>>>> > > > language to be used.
>>>> > > >
>>>> > > > (Meanwhile, I've got J running on OpenBSD, which pleases me.)
>>>> > > >
>>>> > > > --
>>>> > > > Raul
>>>> > > >
>>>> > > > Thanks,
>>>> > > >
>>>> > > > --
>>>> > > > Raul
>>>> > > >
>>>> > > >
>>>> > > > On Tue, Apr 8, 2014 at 2:54 PM, km <[email protected]> wrote:
>>>> > > >
>>>> > > > > I think I would pay for k's database capability.  --Kip Murray
>>>> > > > >
>>>> > > > > Sent from my iPad
>>>> > > > >
>>>> > > > > > On Apr 8, 2014, at 12:46 PM, Björn Helgason <[email protected]>
>>>> > > wrote:
>>>> > > > > >
>>>> > > > > > I would take a look at the mapped file database lab to get ideas.
>>>> > > > > >
>>>> > > > > > -
>>>> > > > > > Björn Helgason
>>>> > > > > > gsm:6985532
>>>> > > > > > skype:gosiminn
>>>> > > > > >> On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote:
>>>> > > > > >>
>>>> > > > > >> I have thought about using symbols, but the only way to delete
>>>> > > symbols
>>>> > > > > that
>>>> > > > > >> I know of involves exiting J. And, my starting premise was that
>>>> I
>>>> > > > would
>>>> > > > > >> have too much data to fit into memory.
>>>> > > > > >>
>>>> > > > > >> For some computations it does make sense to start up an
>>>> > independent
>>>> > > J
>>>> > > > > >> session for each part of the calculation (and, in fact, that is
>>>> > > what I
>>>> > > > > am
>>>> > > > > >> doing in a different aspect of dealing with this dataset - it's
>>>> > > about
>>>> > > > 10
>>>> > > > > >> terabytes, or so I am told - I've not actually seen it all yet
>>>> and
>>>> > > it
>>>> > > > > takes
>>>> > > > > >> time to upload it). But for some calculations you need to be
>>>> able
>>>> > to
>>>> > > > > >> correlate between pieces which have been dealt with elsewhere.
>>>> > > > > >>
>>>> > > > > >> A have similar reservations about fixed-width fields. There's
>>>> just
>>>> > > too
>>>> > > > > much
>>>> > > > > >> data for me to predict how wide the fields are going to be. In
>>>> > some
>>>> > > > > cases I
>>>> > > > > >> might actually be going with fixed-width, but that might be too
>>>> > > > > inefficient
>>>> > > > > >> for the general case. I've one field which would have to be over
>>>> > > 100k
>>>> > > > in
>>>> > > > > >> width if it was fixed width, even though typical cases are
>>>> shorter
>>>> > > > than
>>>> > > > > 1k.
>>>> > > > > >> At some point I might go with fixed width, and I expect that
>>>> doing
>>>> > > so
>>>> > > > > will
>>>> > > > > >> cause me to lose a few records which will be discovered later in
>>>> > > > > >> processing. That might not be a big deal, for this large of a
>>>> data
>>>> > > > set,
>>>> > > > > but
>>>> > > > > >> if it's not necessary why bother?
>>>> > > > > >>
>>>> > > > > >> Finally, Bjorn's suggestion of using mapped files does seem
>>>> like a
>>>> > > > good
>>>> > > > > >> idea, at least for the character data. But that is an
>>>> optimization
>>>> > > and
>>>> > > > > >> optimizations speed up some operations at the expense of slowing
>>>> > > down
>>>> > > > > other
>>>> > > > > >> optimizations. So what really matters is the workload.
>>>> > > > > >>
>>>> > > > > >> Ultimately, for a dataset this large, it's going to take time.
>>>> > > > > >>
>>>> > > > > >> Thanks,
>>>> > > > > >>
>>>> > > > > >> --
>>>> > > > > >> Raul
>>>> > > > > >>
>>>> > > > > >>
>>>> > > > > >>
>>>> > > > > >>
>>>> > > > > >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <
>>>> [email protected]>
>>>> > > > > wrote:
>>>> > > > > >>>
>>>> > > > > >>> It seems this representation is somewhat similar to how the
>>>> > symbol
>>>> > > > > table
>>>> > > > > >>> stores strings:
>>>> > > > > >>>
>>>> > > > > >>> http://m.jsoftware.com/help/dictionary/dsco.htm
>>>> > > > > >>>
>>>> > > > > >>> Also, did you consider using symbols? I've used symbols for
>>>> > string
>>>> > > > > >> columns
>>>> > > > > >>> that contain highly repetitive data, for example, an invoice
>>>> > table
>>>> > > > with
>>>> > > > > >> an
>>>> > > > > >>> alpha-numeric SKU.
>>>> > > > > >>>
>>>> > > > > >>> Thanks for sharing
>>>> > > > > >>>
>>>> > > > > >>>
>>>> > > > > >>>
>>>> > > > > >>>
>>>> > > > > >>>
>>>> > > > > >>>
>>>> > > > > >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <
>>>> > [email protected]
>>>> > > >
>>>> > > > > >> wrote:
>>>> > > > > >>>
>>>> > > > > >>>> Consider this example:
>>>> > > > > >>>>
>>>> > > > > >>>> table=:<;._2;._2]0 :0
>>>> > > > > >>>> First Name,Last Name,Sum,
>>>> > > > > >>>> Adam,Wallace,19,
>>>> > > > > >>>> Travis,Smith,10,
>>>> > > > > >>>> Donald,Barnell,8,
>>>> > > > > >>>> Gary,Wallace,27,
>>>> > > > > >>>> James,Smith,10,
>>>> > > > > >>>> Sam,Johnson,10,
>>>> > > > > >>>> Travis,Neal,11,
>>>> > > > > >>>> Adam,Campbell,11,
>>>> > > > > >>>> Walter,Abbott,13,
>>>> > > > > >>>> )
>>>> > > > > >>>>
>>>> > > > > >>>> Using boxed strings works great for relatively small sets of
>>>> > data.
>>>> > > > But
>>>> > > > > >>> when
>>>> > > > > >>>> things get big, their overhead starts to hurt to much.  (Big
>>>> > > means:
>>>> > > > so
>>>> > > > > >>> much
>>>> > > > > >>>> data that you'll probably not be able to fit it all in memory
>>>> at
>>>> > > the
>>>> > > > > >> same
>>>> > > > > >>>> time. So you need to plan on relatively frequent delays while
>>>> > > > reading
>>>> > > > > >>> from
>>>> > > > > >>>> disk.)
>>>> > > > > >>>>
>>>> > > > > >>>> One alternative to boxed strings is segmented strings. A
>>>> > segmented
>>>> > > > > >> string
>>>> > > > > >>>> is an argument which could be passed to <;._1. It's basically
>>>> > > just a
>>>> > > > > >>> string
>>>> > > > > >>>> with a prefix delimiter. You can work with these sorts of
>>>> > strings
>>>> > > > > >>> directly,
>>>> > > > > >>>> and achieve results similar to what you would achieve with
>>>> boxed
>>>> > > > > >> arrays.
>>>> > > > > >>>>
>>>> > > > > >>>> Segmented strings are a bit clumsier than boxed arrays - you
>>>> > lose
>>>> > > a
>>>> > > > > lot
>>>> > > > > >>> of
>>>> > > > > >>>> the integrity checks, so if you mess up you probably will not
>>>> > see
>>>> > > an
>>>> > > > > >>> error.
>>>> > > > > >>>> So it's probably a good idea to model your code using boxed
>>>> > arrays
>>>> > > > on
>>>> > > > > a
>>>> > > > > >>>> small set of data and then convert to segmented representation
>>>> > > once
>>>> > > > > >>> you're
>>>> > > > > >>>> happy with how things work (and once you see a time cost that
>>>> > > makes
>>>> > > > it
>>>> > > > > >>>> worth spending the time to rework your code).
>>>> > > > > >>>>
>>>> > > > > >>>> Also, to avoid having to use f;._2 (or whatever) every time,
>>>> > it's
>>>> > > > good
>>>> > > > > >> to
>>>> > > > > >>>> do an initial pass on the data, to extract its structure.
>>>> > > > > >>>>
>>>> > > > > >>>> Here's an example:
>>>> > > > > >>>>
>>>> > > > > >>>> FirstName=:;LF&,each }.0{"1 table
>>>> > > > > >>>>
>>>> > > > > >>>> LastName=:;LF&,each }.1{"1 table
>>>> > > > > >>>>
>>>> > > > > >>>> Sum=:;LF&,each }.2{"1 table
>>>> > > > > >>>>
>>>> > > > > >>>>
>>>> > > > > >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),#
>>>> > > > > >>>>
>>>> > > > > >>>> FirstNameDir=: ssdir FirstName
>>>> > > > > >>>> LastNameDir=: ssdir LastName
>>>> > > > > >>>>
>>>> > > > > >>>> Actually, sum is numeric so let's just use a numeric
>>>> > > representation
>>>> > > > > for
>>>> > > > > >>>> that column
>>>> > > > > >>>>
>>>> > > > > >>>> Sum=: _&".@> }.2{"1 table
>>>> > > > > >>>>
>>>> > > > > >>>> Which rows have a last name of Smith?
>>>> > > > > >>>>
>>>> > > > > >>>>   <:({.LastNameDir) I. I.'Smith' E. LastName
>>>> > > > > >>>>
>>>> > > > > >>>> 1 4
>>>> > > > > >>>>
>>>> > > > > >>>>
>>>> > > > > >>>> Actually, there's an assumption there that Smith is not part
>>>> of
>>>> > > some
>>>> > > > > >>> larger
>>>> > > > > >>>> name. We can include the delimiter in the search if we are
>>>> > > concerned
>>>> > > > > >>> about
>>>> > > > > >>>> that. For even more protection we could append a trailing
>>>> > > delimiter
>>>> > > > on
>>>> > > > > >>> our
>>>> > > > > >>>> segmented string and then search for (in this case)
>>>> > LF,'Smith',LF.
>>>> > > > > >>>>
>>>> > > > > >>>>
>>>> > > > > >>>> Anyways, let's extract the corresponding sums and first name:
>>>> > > > > >>>>
>>>> > > > > >>>>
>>>> > > > > >>>>   1 4{Sum
>>>> > > > > >>>>
>>>> > > > > >>>> 10 10
>>>> > > > > >>>>
>>>> > > > > >>>>
>>>> > > > > >>>>   FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir
>>>> > > > > >>>>
>>>> > > > > >>>>
>>>> > > > > >>>> Travis
>>>> > > > > >>>>
>>>> > > > > >>>> James
>>>> > > > > >>>>
>>>> > > > > >>>>
>>>> > > > > >>>> Note that that last expression is a bit complicated. It's not
>>>> so
>>>> > > > bad,
>>>> > > > > >>>> though, if what you are extracting is a small part of the
>>>> total.
>>>> > > > And,
>>>> > > > > >> in
>>>> > > > > >>>> that case, using a list of indices to express a boolean result
>>>> > > seems
>>>> > > > > >>> like a
>>>> > > > > >>>> good thing. You wind up working with set operations
>>>> > (intersection
>>>> > > > and
>>>> > > > > >>>> union) rather than logical operations (and and or). Also, set
>>>> > > > > >> difference
>>>> > > > > >>>> instead of logical not (dyadic -. instead of monadic -.).
>>>> > > > > >>>>
>>>> > > > > >>>>
>>>> > > > > >>>> intersect=: [ -. -.
>>>> > > > > >>>>
>>>> > > > > >>>> union=. ~.@,
>>>> > > > > >>>>
>>>> > > > > >>>>
>>>> > > > > >>>> (It looks like I might be using this kind of thing really
>>>> soon,
>>>> > > so I
>>>> > > > > >>>> thought I'd lay down my thoughts here and invite comment.)
>>>> > > > > >>>>
>>>> > > > > >>>>
>>>> > > > > >>>> Thanks,
>>>> > > > > >>>>
>>>> > > > > >>>>
>>>> > > > > >>>> --
>>>> > > > > >>>>
>>>> > > > > >>>> Raul
>>>> > > > > >>>>
>>>> > > >
>>>> ----------------------------------------------------------------------
>>>> > > > > >>>> For information about J forums see
>>>> > > > > http://www.jsoftware.com/forums.htm
>>>> > > > > >>>>
>>>> > > > > >>>
>>>> > > >
>>>> ----------------------------------------------------------------------
>>>> > > > > >>> For information about J forums see
>>>> > > > http://www.jsoftware.com/forums.htm
>>>> > > > > >>>
>>>> > > > > >>
>>>> > > ----------------------------------------------------------------------
>>>> > > > > >> For information about J forums see
>>>> > > > http://www.jsoftware.com/forums.htm
>>>> > > > > >>
>>>> > > > > >
>>>> > > ----------------------------------------------------------------------
>>>> > > > > > For information about J forums see
>>>> > > http://www.jsoftware.com/forums.htm
>>>> > > > >
>>>> > ----------------------------------------------------------------------
>>>> > > > > For information about J forums see
>>>> > http://www.jsoftware.com/forums.htm
>>>> > > > >
>>>> > > >
>>>> ----------------------------------------------------------------------
>>>> > > > For information about J forums see
>>>> http://www.jsoftware.com/forums.htm
>>>> > > >
>>>> > > ----------------------------------------------------------------------
>>>> > > For information about J forums see http://www.jsoftware.com/forums.htm
>>>> > >
>>>> > ----------------------------------------------------------------------
>>>> > For information about J forums see http://www.jsoftware.com/forums.htm
>>>> >
>>>> > ----------------------------------------------------------------------
>>>> > For information about J forums see http://www.jsoftware.com/forums.htm
>>>> >
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] "Segmented Strings"

Reply via email to