Re: [ngram] Using huge-count.pl with lots of files

2018-04-16 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Catherine,

There was one thing I wanted to mention about huge-count.pl. When you give
it a list of files as input, it treats those files as one single big file.
So if your goal is to maintain file boundaries (not let bigrams cross that
while letting them cross newlines within a single file) then this will be a
problem. I am a little perplexed at the moment about how to solve this. One
thought was that you could convert each of your input files such that the
entire file was on a single line. Then you could input the file names and
use the --newline option, so that bigrams do not cross line boundaries
(which are now file boundaries).

i will keep thinking. The suggestion for using xargs is a nice one, that
will include get past the "argument list too long" error, although there is
still the issue of respecting file boundaries when counting bigrams (which
the above idea is intended to help with...)

Good luck, and keep us posted.
Ted

On Mon, Apr 16, 2018 at 5:20 AM, Serge Sharoff s.shar...@leeds.ac.uk
[ngram]  wrote:

>
>
> with a really large number of files one can use find and xargs:
>
> find . -name '*.txt' | xargs cat
>
> Serge
> --
> *From:* ngram@yahoogroups.com  on behalf of Ted
> Pedersen tpede...@d.umn.edu [ngram] 
> *Sent:* 15 April 2018 23:41:36
> *To:* ngram@yahoogroups.com
> *Subject:* Re: [ngram] Using huge-count.pl with lots of files
>
>
>
> I guess my first thought would be to see if there is a simple way to
> compute the input you are providing to huge count into fewer files. If you
> have a lot of files that start with the letter 'a', for example, you could
> concatentate them all together via a (Linux) command like
>
> cat a* > myafiles.txt
>
> and then use myafiles.txt as an input to huge_count.
>
> This is just one idea, but it's a start perhaps. If this isn't helpful
> please let us know and we can try again!
>
> On Sun, Apr 15, 2018 at 1:19 PM, catherine.dejage...@gmail.com [ngram] <
> ngram@yahoogroups.com > wrote:
>
>
>
> I am trying to get the bigram counts aggregated across a lot of files.
> However, when I ran huge-count.pl using the list of files as an input, I
> got the error "Argument list too long". What would you recommend for
> combining many files, when there are too many files to just run
> huge-count.pl as is?
>
>
> Thank you,
>
> Catherine
>
>
>
>
>
> 
>


Re: [ngram] Using huge-count.pl with lots of files

2018-04-16 Thread Serge Sharoff s.shar...@leeds.ac.uk [ngram]
with a really large number of files one can use find and xargs:

find . -name '*.txt' | xargs cat

Serge


From: ngram@yahoogroups.com  on behalf of Ted Pedersen 
tpede...@d.umn.edu [ngram] 
Sent: 15 April 2018 23:41:36
To: ngram@yahoogroups.com
Subject: Re: [ngram] Using huge-count.pl with lots of files



I guess my first thought would be to see if there is a simple way to compute 
the input you are providing to huge count into fewer files. If you have a lot 
of files that start with the letter 'a', for example, you could concatentate 
them all together via a (Linux) command like

cat a* > myafiles.txt

and then use myafiles.txt as an input to huge_count.

This is just one idea, but it's a start perhaps. If this isn't helpful please 
let us know and we can try again!

On Sun, Apr 15, 2018 at 1:19 PM, 
catherine.dejage...@gmail.com<mailto:catherine.dejage...@gmail.com> [ngram] 
mailto:ng...@yahoogroups..com>> wrote:


I am trying to get the bigram counts aggregated across a lot of files. However, 
when I ran huge-count.pl<http://huge-count.pl> using the list of files as an 
input, I got the error "Argument list too long". What would you recommend for 
combining many files, when there are too many files to just run 
huge-count.pl<http://huge-count.pl> as is?


Thank you,

Catherine







Re: [ngram] Using huge-count.pl with lots of files

2018-04-15 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
I guess my first thought would be to see if there is a simple way to
compute the input you are providing to huge count into fewer files. If you
have a lot of files that start with the letter 'a', for example, you could
concatentate them all together via a (Linux) command like

cat a* > myafiles.txt

and then use myafiles.txt as an input to huge_count.

This is just one idea, but it's a start perhaps. If this isn't helpful
please let us know and we can try again!

On Sun, Apr 15, 2018 at 1:19 PM, catherine.dejage...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:

>
>
> I am trying to get the bigram counts aggregated across a lot of files.
> However, when I ran huge-count.pl using the list of files as an input, I
> got the error "Argument list too long". What would you recommend for
> combining many files, when there are too many files to just run
> huge-count.pl as is?
>
>
> Thank you,
>
> Catherine
>
>
> 
>


[ngram] Using huge-count.pl with lots of files

2018-04-15 Thread catherine.dejage...@gmail.com [ngram]
I am trying to get the bigram counts aggregated across a lot of files. However, 
when I ran huge-count.pl using the list of files as an input, I got the error 
"Argument list too long". What would you recommend for combining many files, 
when there are too many files to just run huge-count.pl as is?
 

 Thank you,
 Catherine