Re: [Jprogramming] Vector Similarity

2018-02-21 Thread 'Bo Jacoby' via Programming
Thank you, Skip! I tried to publish an Ordinal Fraction article in wikipedia, but it was removed because original research is not allowed in wikipedia. However somebody copied it into this link: StateMaster - Encyclopedia: Ordinal fraction . I think that the most obvious use of ordinal fractions

Re: [Jprogramming] Vector Similarity

2018-02-21 Thread Raul Miller
Skip, Are you sure you have the word2vec description right? https://en.wikipedia.org/wiki/Word2vec claims the dimensionality of word2vec is typically in the range of 100 . .. 1000, which would allow treatment of a rather limited vocabulary if each dimension corresponded to a distinct word. The i

[Jprogramming] File Cleanup

2018-02-21 Thread Skip Cave
I read in a text file of word vectors using fread. The format looks like this: bell 0.0264 -0.2927 -0.0254 -0.1034 0.1672 -0.0440 -0.0019 0.1210 ... bell_tower -0.1252 -0.1233 0.1351 0.1897 0.0242 0.0014 0.1942 -0.0237 ... belt 0.1332 0.0142 -0.1208 -0.0574 0.1451 -0.0731 -0.1293 0.0855 ... bel

Re: [Jprogramming] File Cleanup

2018-02-21 Thread 'Jon Hough' via Programming
a=:'belt 0.1332 0.0142 -0.1208 -0.0574 0.1451 -0.0731 -0.1293 0.0855' (}.~ #@:>@:{.@:;: ) a On Wed, 2/21/18, Skip Cave wrote: Subject: [Jprogramming] File Cleanup To: "[email protected]" Date: Wednesday, February 21, 2018, 5:36 PM

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Raul Miller
I think you want this pair of expressions: <@({.~ i.&' ');._2 text 0 1 }. _&".;._2 text (Note that giving ". a left argument tells J that you only want numeric results. When you do this, you don't need to do the search and replace, because J recognizes - as being a minus sign. Also, since

Re: [Jprogramming] File Cleanup

2018-02-21 Thread 'Mike Day' via Programming
txt here is a set of lines from your example with trailing ... removed; here are the first two:     ,.2{.txt +--+ |bell 0.0264 -0.2927 -0.0254 -0.1034 0.1672 -0.0440 -0.0019 0.1210 | +--

Re: [Jprogramming] Vector Similarity

2018-02-21 Thread Skip Cave
Raul, Well, one way to start word2vec* is to* assign a boolean vector to each word, with a single boolean one in a different place for each unique word. That's why it's called 'one hot' embedding. However, after training the word set with a shallow, two-layer neural network, and doing significant

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Ric Sherlock
Another suggestion using some of J's in-built utilities dat=: freads 'yourfile.txt' labels=: <@(' '&taketo);._2 dat numbers=: _ ". (' '&takeafter);._2 dat HTH Ric On Wed, Feb 21, 2018 at 9:57 PM, 'Mike Day' via Programming < [email protected]> wrote: > txt here is a set of lines from y

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Ric Sherlock
Or using the tables/dsv addon: load 'tables/dsv' dat=: makenum ' ' readdsv 'yourfile.txt' Note that although they're boxed the numbers are actually numeric. To split them you could do: labels=: {."1 dat numbers=: > }."1 dat On Wed, Feb 21, 2018 at 11:03 PM, Ric Sherlock wrote: > Another sugge

Re: [Jprogramming] Vector Similarity

2018-02-21 Thread Raul Miller
Sure, you can represent words that way That's basically the same kind of representation that you get encoding each word as a unique integer (index). Just less space efficient. So, for example, in J, if you had 8 words, you could represent the unique sequence of words as i.8 or you cou

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Skip Cave
Thanks to Raul and Mike for the suggestions. I read in the data: nb =: <'C:\numberbatch-en.txt' nbs =. fread nb Then I tried to clean it up: Mike's method ran out of memory: nbs4 =. ( i.&' ' ({.;0 ". }.)] ) every nbs |out of memory When I tried to run it on a smaller set: nbs4=: (i.&' '

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Don Guinn
You need to convert words to a list. Also, night use &> instead of each as It needs to be unboxed to use as an index. On Feb 21, 2018 9:09 AM, "Skip Cave" wrote: > Thanks to Raul and Mike for the suggestions. > > I read in the data: > > > nb =: <'C:\numberbatch-en.txt' > > nbs =. fread nb > > >

Re: [Jprogramming] File Cleanup

2018-02-21 Thread R.E. Boss
vec {~ (<'adults') i.~ words is perhaps what you are looking for R.E. Boss > -Original Message- > From: Programming [mailto:[email protected]] > On Behalf Of Skip Cave > Sent: woensdag 21 februari 2018 17:09 > To: [email protected] > Subject: Re: [Jprogra

Re: [Jprogramming] Vector Similarity

2018-02-21 Thread R.E. Boss
Why not submit it to arXiv.org? R.E. Boss > -Original Message- > From: Programming [mailto:[email protected]] > On Behalf Of 'Bo Jacoby' via Programming > Sent: woensdag 21 februari 2018 09:07 > To: [email protected] > Subject: Re: [Jprogramming] Vector Si

[Jprogramming] First obverse

2018-02-21 Thread David Lambert
For what it's worth, "obverse" essentially means "best approximation to inverse". I want to know if two points are the same, meaning one lies within a sufficiently small sphere about the other.  The correct way for my application would be to test if the Euclidean distance between them is suff

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Raul Miller
Yes, you need to box the name, when comparing it to the value of 'words' because the names in 'words' are all boxed. I'd do it like this: get=:3 :'vec{~words i. wrote: > Thanks to Raul and Mike for the suggestions. > > I read in the data: > > > nb =: <'C:\numberbatch-en.txt' > > nbs =. fread n

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Don Guinn
Defining a verb get to retrieve the index of the desired word as tacit does make get pretty much unreadable; however, there is a possible performance gain as the hash table for i. gets built only once when get is defined. If you will be running get many times this could result in a significant perf

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Raul Miller
That's an interesting point. That said, if you give get a large list of words to look up, it's the sort of issue which might be buried in everything else that's going on (the cost per word gets divided by the number of words being looked up at once). Thanks, -- Raul On Wed, Feb 21, 2018 at 12

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Don Guinn
How about defining get as get=:13 : '(,words)i.boxopen y' Then it can take a single word unboxed or a boxed list. On Wed, Feb 21, 2018 at 10:18 AM, Raul Miller wrote: > That's an interesting point. > > That said, if you give get a large list of words to look up, it's the > sort of issue whic

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Raul Miller
Sure, that could work, but this definition only returns the word index, not the row(s) from vec. So it should get a different name. That said, you could also do something like (what I had been originally thinking of, though not the implementation I had posted): get=: 13 :'vec{~words i. ;:^:(0=

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Henry Rich
I don't think this prescription is accurate.  When m&i. is executed to create a fast search verb, the value of m is put into the new verb.  If m is a name, the value of the name is NOT copied, but instead referred to.  If the name m is subsequently reassigned, the old value is retained, referre

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Don Guinn
So words should be a list instead of a one column table. So we would have words&i. instead of (,words)&i. Correct? Doesn't the raveling prevent sharing of the contents of words in the new verb? And perhaps get should be get=:13 : 'words&i.boxopen y' instead of get=:13 : 'words i.boxopen y'

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Raul Miller
I suppose the shape of the variable 'words' depends on how you built the value for 'words'? If you used words=: <@({.~ i.&' ');._2 text the noun will be rank 1, and there's no need to ravel it. If you built it at rank 2, rather than rank 1... well... I'm not sure why you would want to do that

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Skip Cave
Wow! Thanks so much for all the help on cleaning up and parsing the numberbatch text file, as well as the various methods for extracting words and their associated vectors from the data. It will take me a bit to digest all this, as well as some time to test the various suggestions, to see which sch

Re: [Jprogramming] File Cleanup

2018-02-21 Thread Henry Rich
This is a good teaching example. Consider    words&i. string there are TWO executions.  First, (words&i.) is executed to produce an anonymous verb.  This execution of & does far more than just save the value of words.  It builds a hash table, saves the value of words, and does some other cla

Re: [Jprogramming] File Cleanup

2018-02-21 Thread 'Mike Day' via Programming
I didn’t do a full check on my offering. I wonder, without being able to check easily right now, whether your “nbs” is a rectangular char array rather than a boxed array, like my example, “txt”. I vaguely recall that freadb returns lines as boxes, may be wrong. In any case, I did mean t

[Jprogramming] 807 beta

2018-02-21 Thread chris burke
The first 807 beta is available. To install, browse to code.jsoftware.com/wiki/System/Installation/Beta. This includes: * an updated regex * better support for the Qt WebEngine which allows GUI applications to be a mixture of the usual Qt desktop controls plus web browser controls * support for