[ngram] yahoo groups going away - ngram - Ngram Statistic Package

2019-10-21 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
As you may have heard, Yahoo Groups is going away in a few weeks. This is
what we have been using (for more than 15 years now) for the NSP (Ngram
Statistics Package) mailing list (ngram).

https://help.yahoo.com/kb/SLN31010.html

Over the years I've been archiving the ngram mailing list to mail-archive,
so previous content is available there (going back many years now).

https://www.mail-archive.com/ngram@yahoogroups.com/

The email list is not too active these days, so I am planning to use the
more general DuluthNLP email list as a place to post updates about NSP or
where you can post if you have questions. Folks continue to use NSP so we
will continue to answer questions as they arise. Please feel free to join
up if you would like to stay in touch.

https://groups.google.com/forum/#!forum/duluthnlp

The NSP project page remains at :

http://ngram.sourceforge.net/

Thanks for your interest in NSP over the years, and please do stay in
touch.

Cordially,
Ted
---
Ted Pedersen
http://www.d.umn.edu/~tpederse


[ngram] Re: Some questions about Text-NSP

2018-12-06 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
My apologies for being a bit slow in following up on this. But, I
think for identifying significant or interesting bigrams with Fisher's
exact test, a left sided test makes the most sense. The left sided
test gives us the probability that the pair of words would occur
together less frequently if we repeated our experiment on another
sample of text. If the left sided probability is high it means our
current observation is a much more frequent than we'd expect (just
based on pure chance) and so the pair of words we have observed are
that likely to be significant or interesting.

I hope this makes some sense, but please feel free to follow up if it
doesn't or if you think I may be misinterpreting something here.

Cordially,
Ted

---
Ted Pedersen
http://www.d.umn.edu/~tpederse

On Sun, Nov 25, 2018 at 6:28 PM Ted Pedersen  wrote:
>
> Thanks for these questions - all of the details are quite helpful. And
> yes, I think your method for computing n12 and n22 are just fine.
>
> As a historical note, it's worth pointing out the Fishing for
> Exactness paper pre-dates Text-NSP by a number of years. This paper
> was published 1996, and Text-NSP began in about 2002 and was actively
> developed for several years thereafter. That said, when implementing
> Text-NSP we were certainly basing it off of this earlier work and so
> I'd hope the results from Text-NSP would be consistent with the paper.
> To that end I ran the example you gave on Text-NSP and show the
> results below. What you see is consistent with what you ran in python,
> and so it seems pretty clear that the results from the paper are
> indeed the two tailed test (contrary to what the paper says).
>
> cat x.cnt
> 1382828
> and<>industry<>22 30707 952
>
> statistic.pl leftFisher x.left x.cnt
>
> cat x.left
> 1382828
> and<>industry<>1 0.6297 22 30707 952
>
> statistic.pl rightFisher x.right x.cnt
>
> cat x.right
> 1382828
> and<>industry<>1 0.4546 22 30707 952
>
> statistic.pl twotailed x.two x.cnt
>
> cat x.two
> 1382828
> and<>industry<>1 0.8253 22 30707 952
>
> As to your more general question of what should be done, I will need
> to refresh my recollection of this, although in general the
> interpretation of left, right and two sided tests depend on your null
> hypothesis. In our case, and for finding "dependent" bigrams in
> general, the null hypothesis is that the two words are independent,
> and so we are seeking evidence to either confirm or deny that
> hypothesis. The left sided test (for Fisher's exact) is giving us the
> p-value of n11 < 22. How to interpret that is where I need to refresh
> my recollection, but that is the general direction things are heading.
>
> I think a one sided test makes more sense for identifying dependent
> bigrams, since in general if you have more occurrences than you expect
> by chance, at some point beyond that expected value you are going to
> decide it's not a chance occurrence. There is no value above the
> expected value where you are going to say (I don't think) oh no, these
> two words are no longer dependent on each other (ie they are occurring
> too frequently to be dependent). I think a two tailed test makes the
> most sense if there is a point both above and below the expected value
> where your null hypothesis is potentially rejected.
>
> In the case of "and industry" where the expected value is 21.14, it
> seems very hard to argue that 22 occurrences is enough to say that
> they are dependent. But, this is where I'm just a little foggy right
> now. I'll look at this a little more and reply a bit more precisely.
>
> I'm not sure about they keyword extraction case, but if you have an
> example I'd be happy to think a little further about that as well!
>
> More soon,
> Ted
> ---
> Ted Pedersen
> http://www.d.umn.edu/~tpederseOn Sun, Nov 25, 2018 at 11:32 AM BLK
> Serene  wrote:
> >
> > Thanks for the clarification!
> >
> > And I have some other question about your paper "Fishing for Exactness"
> >
> > 1. The paper says that "In the test for association to determine bigram 
> > dependence Fisher's exact test is interpreted as a left-sided test."
> > And in last part "Experiment: Test for Association", it also says that "In 
> > this experiment, we compare the significance values computed using the 
> > t-test, the x2 approximation to the distribution of both G2 and X2 and 
> > Fisher's exact test (left sided)".
> > But as for the examples given in "Figure 8: test for association:  
> > industry":
> > E.g. for word "and", the given data is:
> > n++ (total number of tokens in the corpus): 1382828 (taken from "Figure 
> > 3")
> > n+1 (total frequency of "industry"): 952 (taken from "Figure 3")
> >
> > n11 = 22
> > n21 = 952 - 22 = 930
> >
> > Since n12 is not given in the table, I have to compute it by
> > m11 = n1+ * n+1 / n++
> > so n1+ is 21.14 * 1382828 / 952 = 30706.915882352943 (approximately 
> > 30707)
> >
> > And then:
> > n12 = 30707 - 22 = 30685
> > n22 = 1382828 - 952 

[ngram] Re: Some questions about Text-NSP

2018-11-25 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Thanks for these questions - all of the details are quite helpful. And
yes, I think your method for computing n12 and n22 are just fine.

As a historical note, it's worth pointing out the Fishing for
Exactness paper pre-dates Text-NSP by a number of years. This paper
was published 1996, and Text-NSP began in about 2002 and was actively
developed for several years thereafter. That said, when implementing
Text-NSP we were certainly basing it off of this earlier work and so
I'd hope the results from Text-NSP would be consistent with the paper.
To that end I ran the example you gave on Text-NSP and show the
results below. What you see is consistent with what you ran in python,
and so it seems pretty clear that the results from the paper are
indeed the two tailed test (contrary to what the paper says).

cat x.cnt
1382828
and<>industry<>22 30707 952

statistic.pl leftFisher x.left x.cnt

cat x.left
1382828
and<>industry<>1 0.6297 22 30707 952

statistic.pl rightFisher x.right x.cnt

cat x.right
1382828
and<>industry<>1 0.4546 22 30707 952

statistic.pl twotailed x.two x.cnt

cat x.two
1382828
and<>industry<>1 0.8253 22 30707 952

As to your more general question of what should be done, I will need
to refresh my recollection of this, although in general the
interpretation of left, right and two sided tests depend on your null
hypothesis. In our case, and for finding "dependent" bigrams in
general, the null hypothesis is that the two words are independent,
and so we are seeking evidence to either confirm or deny that
hypothesis. The left sided test (for Fisher's exact) is giving us the
p-value of n11 < 22. How to interpret that is where I need to refresh
my recollection, but that is the general direction things are heading.

I think a one sided test makes more sense for identifying dependent
bigrams, since in general if you have more occurrences than you expect
by chance, at some point beyond that expected value you are going to
decide it's not a chance occurrence. There is no value above the
expected value where you are going to say (I don't think) oh no, these
two words are no longer dependent on each other (ie they are occurring
too frequently to be dependent). I think a two tailed test makes the
most sense if there is a point both above and below the expected value
where your null hypothesis is potentially rejected.

In the case of "and industry" where the expected value is 21.14, it
seems very hard to argue that 22 occurrences is enough to say that
they are dependent. But, this is where I'm just a little foggy right
now. I'll look at this a little more and reply a bit more precisely.

I'm not sure about they keyword extraction case, but if you have an
example I'd be happy to think a little further about that as well!

More soon,
Ted
---
Ted Pedersen
http://www.d.umn.edu/~tpederseOn Sun, Nov 25, 2018 at 11:32 AM BLK
Serene  wrote:
>
> Thanks for the clarification!
>
> And I have some other question about your paper "Fishing for Exactness"
>
> 1. The paper says that "In the test for association to determine bigram 
> dependence Fisher's exact test is interpreted as a left-sided test."
> And in last part "Experiment: Test for Association", it also says that "In 
> this experiment, we compare the significance values computed using the 
> t-test, the x2 approximation to the distribution of both G2 and X2 and 
> Fisher's exact test (left sided)".
> But as for the examples given in "Figure 8: test for association:  
> industry":
> E.g. for word "and", the given data is:
> n++ (total number of tokens in the corpus): 1382828 (taken from "Figure 
> 3")
> n+1 (total frequency of "industry"): 952 (taken from "Figure 3")
>
> n11 = 22
> n21 = 952 - 22 = 930
>
> Since n12 is not given in the table, I have to compute it by
> m11 = n1+ * n+1 / n++
> so n1+ is 21.14 * 1382828 / 952 = 30706.915882352943 (approximately 30707)
>
> And then:
> n12 = 30707 - 22 = 30685
> n22 = 1382828 - 952 - 30707 + 22 = 1351191
>
> I'm not sure if my calculation is correct, but when using n11 = 22, n12 = 
> 30685, n21 = 930, n22 = 1351191 as the input, the left-sided fisher's exact 
> test gives the result 0.6296644386744733 which is not matched with 0.8255 
> given in the example. I use Python's Scipy module to calculate this:
>
> >>> scipy.stats.fisher_exact([[22, 30685], [930, 1351191]], alternative = 
> >>> 'less') # the parameter "alternative" specifies the left-sided test be 
> >>> used
> (1.041670459980972, 0.6296644386744733) # The first value is Odds Ratio 
> (irrelevant), the second is the p-value given by Fisher's exact test
>
> Then I tried the two-tailed test, which gave the expected value 
> (approximately):
>
> >>> scipy.stats.fisher_exact([[22, 30685], [930, 1351191]], alternative = 
> >>> 'two-sided') # Two-sided test
> (1.041670459980972, 0.8253462481347)
>
> So I suppose that the results given in the figure is actually calculated 
> using the two-sided Fisher's exact test (is it a mistake or 

[ngram] Re: Some questions about Text-NSP

2018-11-25 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Blk,

Thanks for pointing these out. On the Poisson Stirling measure, I
think the reason we haven't included log n is that log n would simply
be a constant (log of the total number of bigrams) and so would not
change the rankings that we get from these scores. That said, if you
were comparing scores across different sized corpora then the
denominator would likely be important to include.

Thanks for pointing out the typos. Text-NSP is right now in a fairly
dormant state, but I do have a list of small changes to make and will
add yours to these.

Thanks for your interest, and please let us know if you have any other
questions.

Cordially,
Ted
---
Ted Pedersen
http://www.d.umn.edu/~tpederse

On Sun, Nov 25, 2018 at 4:13 AM BLK Serene  wrote:
>
> Hi, I have some questions about the association measures implemented in 
> Text-NSP:
>
> The Poisson-Sterling Measure given in the documentation is:
> Poisson-Stirling = n11 * ( log(n11) - log(m11) - 1)
>
> But in Quasthoff's paper the formulae given by the author is:
> sig(A, B) = (k * (log k - log λ - 1)) / log n
>
> I'm a little confused since I know little about math or statistics. Why is 
> the denominator omitted here?
>
> And some typos in the doc:
> square of phi coefficient:
> PHI^2 = ((n11 * n22) - (n21 * n21))^2/(n1p * np1 * np2 * n2p)
> where n21 *n21 should be n12 * n21
>
> chi-squared test:
> Pearson's Chi-squred test measures the devitation (should be deviation) 
> between
>
> Pearson's Chi-Squared = 2 * [((n11 - m11)/m11)^2 + ((n12 - m12)/m12)^2 +
>  ((n21 - m21)/m21)^2 + ((n22 -m22)/m22)^2]
> should be: ((n11 - m11)/m11)^2 + ((n12 - m12)/m12)^2 +
>((n21 - m21)/m21)^2 + ((n22 -m22)/m22)^2
>
> And chi2: same as above.
>
> Thanks in advance.


Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-17 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
There is not a way to make huge-count.pl (or count.pl) case insensitive. It
will take the input pretty much "as is" and use that. So, I think you'd
need to lower case your files before they made it to huge-count.pl. You can
use --token to specify how you tokenize words (like do you treat don't as
three tokens (don ' t) or one (don't). --stop lets you exclude words from
being counted, but there isn't anything that lets you ignore case.

On Tue, Apr 17, 2018 at 8:51 AM, Ted Pedersen  wrote:

> Hi Catherine,
>
> Here are a few answers to your questions, hopefully.
>
> I don't think we'll be able to update this code anytime soon - we just
> don't have anyone available to work on that right now, unfortunately. That
> said we are very open to others making contributions, fixes, etc.
>
> The number of files that your system allows is pretty dependent on your
> system and operating system. On Linux you can actually adjust that (if you
> have sudo access) by running
>
> ulimit -s 10
>
> or
>
> ulimit -s unlimited
>
> This increases the number of processes that can be run at any one time,
> which can allow your system to handle more command line arguments (since
> each file name probably causes it's own process to be created...?,
> speculating just a bit there) But if you don't have sudo access this is not
> something you can do.
>
> As far as taking multiple outputs from huge-count.pl and merging them
> with huge-merge, I think the answer is that's almost possible, but not
> quite. huge-merge is not expecting the bigram count that appears on the
> first line of huge-count.pl output to be there, and seems to fail as a
> result. So you would need to remove that first line from your
> huge-count.pl output before merging.
>
> The commands below kind of break down what is happening within
> huge-count.pl. If you run this you can get an idea of the input output
> expected by each stage...
>
> count.pl --tokenlist input1.out input1
>
> count.pl --tokenlist input2.out input2
>
> huge-sort.pl --keep input1.out
>
> huge-sort.pl --keep input2.out
>
> mkdir output-directory
>
> mv input1.out-sorted output-directory
>
> mv input2.out-sorted output-directory
>
> huge-merge.pl --keep output-directory
>
> I hope this helps. I realize it's not exactly a solution, but I hope it's
> helpful all the same. I'll go through your notes again and see if there are
> other issues to address...and of course if you try something and it does or
> doesn't work I'm very interested in hearing about that...
>
> Cordially,
> Ted
>
>
> On Tue, Apr 17, 2018 at 7:33 AM, Ted Pedersen  wrote:
>
>> The good news is that our documentation is more reliable than my memory.
>> :) huge-count treats each file separately and so bigrams do not cross file
>> boundaries. Having verified that I'll get back to your original question..
>> Sorry about the diversion and the confusion that might have caused.
>>
>> More soon,
>> Ted
>>
>> On Mon, Apr 16, 2018 at 4:11 PM, Ted Pedersen  wrote:
>>
>>> Let me go back and revisit this again, I seem to have confused myself!
>>>
>>> More soon,
>>> Ted
>>>
>>> On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram]
>>>  wrote:
>>>


 Did I misread the documentation then?

 "huge-count.pl doesn't consider bigrams at file boundaries. In other
 words,
 the result of count.pl and huge-count.pl on the same data file will
 differ if --newLine is not used, in that, huge-count.pl runs count.pl
 on multiple files separately and thus looses the track of the bigrams
 on file boundaries. With --window not specified, there will be loss
 of one bigram at each file boundary while its W bigrams with --window
 W."

 I thought that means bigrams won't cross from one file to the next?

 If bigrams don't cross from one file to the next, then I just need to
 run huge-count.pl on smaller inputs, then combine. So if I break
 @filenames into smaller subsets, then call huge-count.pl on the
 subsets, then call huge-merge.pl to combine the counts, I think that
 should work.

 I have a few more questions related to usage:

- Do you know how many arguments are allowed for huge-count.pl? It
would be good to know what size chunks I need to split my data into.. 
 Or if
not, then how would I do a try catch block to catch the error "Argument
list to long" from the IPC::System::Simple::system call?
- Is there case-insensitive way to count bigrams, or would I need
to convert all the text to lowercase before calling huge-count.pl?
- Would you consider modifying huge-count.pl so that the user can
specify the final output filename, instead of just automatically calling
the output file complete-huge-count.output?

 Thank you,
 Catherine

 

>>>
>>>
>>
>


Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-17 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Catherine,

Here are a few answers to your questions, hopefully.

I don't think we'll be able to update this code anytime soon - we just
don't have anyone available to work on that right now, unfortunately. That
said we are very open to others making contributions, fixes, etc.

The number of files that your system allows is pretty dependent on your
system and operating system. On Linux you can actually adjust that (if you
have sudo access) by running

ulimit -s 10

or

ulimit -s unlimited

This increases the number of processes that can be run at any one time,
which can allow your system to handle more command line arguments (since
each file name probably causes it's own process to be created...?,
speculating just a bit there) But if you don't have sudo access this is not
something you can do.

As far as taking multiple outputs from huge-count.pl and merging them with
huge-merge, I think the answer is that's almost possible, but not quite.
huge-merge is not expecting the bigram count that appears on the first line
of huge-count.pl output to be there, and seems to fail as a result. So you
would need to remove that first line from your huge-count.pl output before
merging.

The commands below kind of break down what is happening within huge-count.pl.
If you run this you can get an idea of the input output expected by each
stage...

count.pl --tokenlist input1.out input1

count.pl --tokenlist input2.out input2

huge-sort.pl --keep input1.out

huge-sort.pl --keep input2.out

mkdir output-directory

mv input1.out-sorted output-directory

mv input2.out-sorted output-directory

huge-merge.pl --keep output-directory

I hope this helps. I realize it's not exactly a solution, but I hope it's
helpful all the same. I'll go through your notes again and see if there are
other issues to address...and of course if you try something and it does or
doesn't work I'm very interested in hearing about that...

Cordially,
Ted


On Tue, Apr 17, 2018 at 7:33 AM, Ted Pedersen  wrote:

> The good news is that our documentation is more reliable than my memory.
> :) huge-count treats each file separately and so bigrams do not cross file
> boundaries. Having verified that I'll get back to your original question.
> Sorry about the diversion and the confusion that might have caused.
>
> More soon,
> Ted
>
> On Mon, Apr 16, 2018 at 4:11 PM, Ted Pedersen  wrote:
>
>> Let me go back and revisit this again, I seem to have confused myself!
>>
>> More soon,
>> Ted
>>
>> On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram] <
>> ngram@yahoogroups.com> wrote:
>>
>>>
>>>
>>> Did I misread the documentation then?
>>>
>>> "huge-count.pl doesn't consider bigrams at file boundaries. In other
>>> words,
>>> the result of count.pl and huge-count.pl on the same data file will
>>> differ if --newLine is not used, in that, huge-count.pl runs count.pl
>>> on multiple files separately and thus looses the track of the bigrams
>>> on file boundaries. With --window not specified, there will be loss
>>> of one bigram at each file boundary while its W bigrams with --window W.."
>>>
>>> I thought that means bigrams won't cross from one file to the next?
>>>
>>> If bigrams don't cross from one file to the next, then I just need to
>>> run huge-count.pl on smaller inputs, then combine. So if I break
>>> @filenames into smaller subsets, then call huge-count.pl on the
>>> subsets, then call huge-merge.pl to combine the counts, I think that
>>> should work.
>>>
>>> I have a few more questions related to usage:
>>>
>>>- Do you know how many arguments are allowed for huge-count.pl? It
>>>would be good to know what size chunks I need to split my data into. Or 
>>> if
>>>not, then how would I do a try catch block to catch the error "Argument
>>>list to long" from the IPC::System::Simple::system call?
>>>- Is there case-insensitive way to count bigrams, or would I need to
>>>convert all the text to lowercase before calling huge-count.pl?
>>>- Would you consider modifying huge-count.pl so that the user can
>>>specify the final output filename, instead of just automatically calling
>>>the output file complete-huge-count.output?
>>>
>>> Thank you,
>>> Catherine
>>>
>>> 
>>>
>>
>>
>


Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-16 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Let me go back and revisit this again, I seem to have confused myself!

More soon,
Ted

On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> Did I misread the documentation then?
>
> "huge-count.pl doesn't consider bigrams at file boundaries. In other
> words,
> the result of count.pl and huge-count.pl on the same data file will
> differ if --newLine is not used, in that, huge-count.pl runs count.pl
> on multiple files separately and thus looses the track of the bigrams
> on file boundaries. With --window not specified, there will be loss
> of one bigram at each file boundary while its W bigrams with --window W."
>
> I thought that means bigrams won't cross from one file to the next?
>
> If bigrams don't cross from one file to the next, then I just need to run
> huge-count.pl on smaller inputs, then combine. So if I break @filenames
> into smaller subsets, then call huge-count.pl on the subsets, then call
> huge-merge.pl to combine the counts, I think that should work.
>
> I have a few more questions related to usage:
>
>- Do you know how many arguments are allowed for huge-count.pl? It
>would be good to know what size chunks I need to split my data into. Or if
>not, then how would I do a try catch block to catch the error "Argument
>list to long" from the IPC::System::Simple::system call?
>- Is there case-insensitive way to count bigrams, or would I need to
>convert all the text to lowercase before calling huge-count.pl?
>- Would you consider modifying huge-count.pl so that the user can
>specify the final output filename, instead of just automatically calling
>the output file complete-huge-count.output?
>
> Thank you,
> Catherine
>
> 
>


Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-15 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Catherine,

Just to make sure I'm understanding what you'd like to do, could you send
the command you are trying to run, and some idea of the number of files
you'd like to process?

Thanks!
Ted

On Sun, Apr 15, 2018 at 6:01 PM, catherine.dejage...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> That makes sense, but I'm not sure it will give me the behavior I want. I
> don't want bigrams to span from one file to the next, but I do want them to
> span across newlines. If I concatenate the files, then as I understand it
> my first condition is no longer met. Could I run huge-count.pl on
> subgroups of files, then combine the results? And how would I do that?
> 
>


Re: [ngram] Using huge-count.pl with lots of files

2018-04-15 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
I guess my first thought would be to see if there is a simple way to
compute the input you are providing to huge count into fewer files. If you
have a lot of files that start with the letter 'a', for example, you could
concatentate them all together via a (Linux) command like

cat a* > myafiles.txt

and then use myafiles.txt as an input to huge_count.

This is just one idea, but it's a start perhaps. If this isn't helpful
please let us know and we can try again!

On Sun, Apr 15, 2018 at 1:19 PM, catherine.dejage...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> I am trying to get the bigram counts aggregated across a lot of files.
> However, when I ran huge-count.pl using the list of files as an input, I
> got the error "Argument list too long". What would you recommend for
> combining many files, when there are too many files to just run
> huge-count.pl as is?
>
>
> Thank you,
>
> Catherine
>
>
> 
>


[ngram] Re: PMI Query

2017-05-14 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Julio,

Thanks for your question. In NSP we are always counting ngrams, so the
order of the words making up the ngram is considered. When we are counting
bigrams (the default case for NSP)  word1 is always the first word in a
bigram, and word2 is always the second word. I think in other presentations
of PMI word1 and word2 are simply co-occurrences, so the order does not
matter. However, for NSP order does matter and so n1p is the number of
times word1 occurs as the first word in a ngram.

Here's a very simple example where cat occurs as the first word in a bigram
3 times and as the second word in a bigram 1 time. Note that I've used the
--newline option so that ngrams do not extend across lines.

ukko(14): cat test
cat mouse
cat mouse
cat mouse
house cat
ukko(15): count.pl --newline test.cnt test
ukko(16): cat test.cnt
4
cat<>mouse<>3 3 3
house<>cat<>1 1 1

This is described in more detail in the NSP paper (see below), which would
be a reasonable reference I think :I hope this helps, and please let us
know if other questions arise.

Cordially,
Ted

The Design, Implementation, and Use of the Ngram Statistics Package
(Banerjee and
Pedersen) - Appears in the Proceedings of the Fourth International
Conference on Intelligent Text Processing and Computational Linguistics,
pp. 370-381, February 17-21, 2003, Mexico City.



On Sun, May 14, 2017 at 12:10 AM, Julio Santisteban 
wrote:

> Hi Ted & Satanjeev ,
>
> I am Julio from Peru and I have a small query. In your Perl implementation
> of PMI  you mention about the contingency table: "n1p is the number of
> times in total that word1 occurs as the first word in a bigram".  but this
> is not the case, usually PMI is workout  with n1p  as the marginals (total
> frequency of word1) from the contingency table.
>
> I am sure you are correct, I just want to ask you some reference about it.
>
> http://search.cpan.org/~tpederse/Text-NSP-1.31/lib/
> Text/NSP/Measures/2D/MI/pmi.pm
>
>   word2   ~word2
>   word1n11  n12 | n1p
>  ~word1n21  n22 | n2p
>--
>np1  np2   npp
>
>
> Regards,
> Julio Santisteban
>


Re: [ngram] Upload files

2017-04-01 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
I think this mail was somehow delayed, but I hope this response is still
useful.

NSP has a  command line interface. In general you specify the output file
first, and the input file second. So if you want to write the output of
count.pl to a file called myoutput.txt, and if your input text is
myinput.txt, you could submit the following command.

count.pl myoutput.txt myinput.txt

Here's an example 

ted@ted-HP-Z210-CMT-Workstation ~ $ cat myinput.txt
hi this is ted speaking how are you today!
I am well.
Today is April 1.

ted@ted-HP-Z210-CMT-Workstation ~ $ count.pl myoutput.txt myinput.txt

ted@ted-HP-Z210-CMT-Workstation ~ $ cat myoutput.txt
18
you<>today<>1 1 1
well<>.<>1 1 2
how<>are<>1 1 1
today<>!<>1 1 1
am<>well<>1 1 1
is<>ted<>1 2 1
I<>am<>1 1 1
is<>April<>1 2 1
ted<>speaking<>1 1 1
.<>Today<>1 1 1
speaking<>how<>1 1 1
hi<>this<>1 1 1
this<>is<>1 1 2
April<>1<>1 1 1
1<>.<>1 1 2
Today<>is<>1 1 2
are<>you<>1 1 1
!<>I<>1 1 1

I hope this helps!
Ted

On Tue, Jan 31, 2017 at 9:54 AM, rocioc...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> Hello Ted,
>
> Thank you very much for your message, but I still don't know how I can
> take a file as input :( this is being a huge challenge for me, I hope you
> can still give some help with that.
>
> Thanks again, and sorry to disturb you.
> Rocío
> 
>


Re: [ngram] Upload files

2017-01-31 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Text::NSP has a command line interface that allows you to provide a file or
a folder/directory for input. There are some simple examples shown below
that take a single file as input. That might be a good place to start, just
to make sure everything is working as expected.

http://search.cpan.org/dist/Text-NSP/doc/USAGE.pod

Please let us know as questions arise!

Good luck,
Ted

On Tue, Jan 31, 2017 at 7:52 AM, rocioc...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> Dear colleages,
>
> I am new in this. I've just installed Perl and Text-NSP, but I have no
> idea how I can supply files to work with. Should I put them in a specific
> folder?
>
> Thanks in advance,
> Rocío
>
> 
>


Re: [ngram] Ignoring regex with no delimiters

2016-05-12 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
The regex in token should look like this :

/\S+/

I think not having the / / is causing the delimeter errors...

On Thu, May 12, 2016 at 2:11 AM, amir.jad...@yahoo.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> I'm running count.pl on a set of unicode documents. Create a new
> file('token') which contains '\S+' in order to match any characters but
> space.
>
> Here is the output:
>
>
> ⇒  count.pl --ngram=1 --token=token ocount.txt Documents
>
> Ignoring regex with no delimiters: \S+
>
> No token definitions to work with.
>
> Type count.pl --help for help.
>
>
> What's the problem?!
>
> 
>


Re: [ngram] count.pl for unicode documents

2016-05-10 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Tokenization and the --token option are described here :

http://search.cpan.org/~tpederse/Text-NSP/doc/README.pod#2._Tokens

On Tue, May 10, 2016 at 8:14 AM, amir.jad...@yahoo.com [ngram] <
ngram@yahoogroups.com> wrote:

>
> [Attachment(s) <#m_-6964475169159201585_TopText> from
> amir.jad...@yahoo.com included below]
>
> I'm trying to run count.pl for a directory of unicode documents (a sample
> document has been attached) using Perl 5 (v5.18.2). The output is a list
> of digits and punctuations without any unicode word:
>
> 2732
>
> .<>1589
>
> :<>626
>
> 2<>19
>
> !<>17
>
> 10<>16
>
> 4<>14
>
> 13<>13
>
> 12<>13
>
> 20<>12
>
> 9<>11
>
> 15<>11
>
> 3<>10
>
> 5<>10
>
> Is it possible to ask count.pl to tokenize the input file just by space?
>
> There is --token option which maybe useful. But I don't how to use it.
>
> 
>


Re: [ngram] How to recognize informative n-grams in a corpus?

2016-05-10 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
The Ngram Statistics Package is mostly intended to help you find the most
frequent ngrams in a corpus, or the most strongly associated ngrams in a
corpus. It doesn't necessarily directly give you informativeness, although
you can certainly come up with ways to use frequency and measures of
association to find that. It sounds like you should look at our paper on
NSP to get some ideas about how to use it, and what it offers.

http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf

Also, the code itself has some documentation that should be helpful...

http://search.cpan.org/~tpederse/Text-NSP/doc/README.pod

http://search.cpan.org/~tpederse/Text-NSP/doc/USAGE.pod

I hope this helps!
Ted

On Tue, May 10, 2016 at 5:22 AM, 'Amir H. Jadidinejad' amir.jad...@yahoo.com
[ngram]  wrote:

>
>
> Hi,
>
> I have a corpus of 3K short text documents. I’m going to *recognize the
> most informative n-grams* in the corpus.
> Unfortunately, I can’t find a straight way from the documents. Would you
> please help me?
>
> Kind regards,
> Amir H. Jadidinejad
>
> 
>


[ngram] the (apparent) demise of search.cpan.org

2014-07-18 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
For many years now, http://search.cpan.org has been my go-to link for
finding CPAN distributions, and has been the URL we've listed on our web
sites directing users to Perl software downloads.

Sadly the site has become very unreliable in the last few months, and there
does not appear to be a solution in the works. So, I've decided to
gradually migrate to using https:://metacpan.org as our default web site
for finding and pointing at CPAN distributions.

This will involve making changes on web pages and in documentation, and it
will take a while to do  But, it seems important since the impression can
be created by the search site that CPAN is down. It's not. CPAN is alive
and well, it's just that one particular navigator is not working too well.

I hope to make these changes on the main package pages fairly soon, but in
the event you run into a 503 or 504 error when accessing the search site,
please realize there are other ways, and that CPAN is just fine.

Here's some additional commentary and info about this issue

https://github.com/perlorg/perlweb/issues/115
http://perlhacks.com/2013/01/give-me-metacpan/
http://www.perlmonks.org/index.pl?node_id=1093542
http://grokbase.com/t/perl/beginners/145nsxqz2w/cpan-unavailable

When we started using the search site in about 2002 it was pretty great.
The good news is that https://metacpan.org is even better, so this is a
positive change.

Thanks,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse