Re: [ngram] Re: Using huge-count.pl with lots of files

Ted Pedersen tpede...@d.umn.edu [ngram] Tue, 17 Apr 2018 07:26:24 -0700

There is not a way to make huge-count.pl (or count.pl) case insensitive. It
will take the input pretty much "as is" and use that. So, I think you'd
need to lower case your files before they made it to huge-count.pl. You can
use --token to specify how you tokenize words (like do you treat don't as
three tokens (don ' t) or one (don't). --stop lets you exclude words from
being counted, but there isn't anything that lets you ignore case.


On Tue, Apr 17, 2018 at 8:51 AM, Ted Pedersen <tpede...@d.umn.edu> wrote:

> Hi Catherine,
>
> Here are a few answers to your questions, hopefully.
>
> I don't think we'll be able to update this code anytime soon - we just
> don't have anyone available to work on that right now, unfortunately. That
> said we are very open to others making contributions, fixes, etc.
>
> The number of files that your system allows is pretty dependent on your
> system and operating system. On Linux you can actually adjust that (if you
> have sudo access) by running
>
> ulimit -s 100000
>
> or
>
> ulimit -s unlimited
>
> This increases the number of processes that can be run at any one time,
> which can allow your system to handle more command line arguments (since
> each file name probably causes it's own process to be created...?,
> speculating just a bit there) But if you don't have sudo access this is not
> something you can do.
>
> As far as taking multiple outputs from huge-count.pl and merging them
> with huge-merge, I think the answer is that's almost possible, but not
> quite. huge-merge is not expecting the bigram count that appears on the
> first line of huge-count.pl output to be there, and seems to fail as a
> result. So you would need to remove that first line from your
> huge-count.pl output before merging.
>
> The commands below kind of break down what is happening within
> huge-count.pl. If you run this you can get an idea of the input output
> expected by each stage...
>
> count.pl --tokenlist input1.out input1
>
> count.pl --tokenlist input2.out input2
>
> huge-sort.pl --keep input1.out
>
> huge-sort.pl --keep input2.out
>
> mkdir output-directory
>
> mv input1.out-sorted output-directory
>
> mv input2.out-sorted output-directory
>
> huge-merge.pl --keep output-directory
>
> I hope this helps. I realize it's not exactly a solution, but I hope it's
> helpful all the same. I'll go through your notes again and see if there are
> other issues to address...and of course if you try something and it does or
> doesn't work I'm very interested in hearing about that...
>
> Cordially,
> Ted
>
>
> On Tue, Apr 17, 2018 at 7:33 AM, Ted Pedersen <tpede...@d.umn.edu> wrote:
>
>> The good news is that our documentation is more reliable than my memory.
>> :) huge-count treats each file separately and so bigrams do not cross file
>> boundaries. Having verified that I'll get back to your original question..
>> Sorry about the diversion and the confusion that might have caused.
>>
>> More soon,
>> Ted
>>
>> On Mon, Apr 16, 2018 at 4:11 PM, Ted Pedersen <tpede...@d.umn.edu> wrote:
>>
>>> Let me go back and revisit this again, I seem to have confused myself!
>>>
>>> More soon,
>>> Ted
>>>
>>> On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram]
>>> <ngram@yahoogroups.com> wrote:
>>>
>>>>
>>>>
>>>> Did I misread the documentation then?
>>>>
>>>> "huge-count.pl doesn't consider bigrams at file boundaries. In other
>>>> words,
>>>> the result of count.pl and huge-count.pl on the same data file will
>>>> differ if --newLine is not used, in that, huge-count.pl runs count.pl
>>>> on multiple files separately and thus looses the track of the bigrams
>>>> on file boundaries. With --window not specified, there will be loss
>>>> of one bigram at each file boundary while its W bigrams with --window
>>>> W."
>>>>
>>>> I thought that means bigrams won't cross from one file to the next?
>>>>
>>>> If bigrams don't cross from one file to the next, then I just need to
>>>> run huge-count.pl on smaller inputs, then combine. So if I break
>>>> @filenames into smaller subsets, then call huge-count.pl on the
>>>> subsets, then call huge-merge.pl to combine the counts, I think that
>>>> should work.
>>>>
>>>> I have a few more questions related to usage:
>>>>
>>>>    - Do you know how many arguments are allowed for huge-count.pl? It
>>>>    would be good to know what size chunks I need to split my data into.. 
>>>> Or if
>>>>    not, then how would I do a try catch block to catch the error "Argument
>>>>    list to long" from the IPC::System::Simple::system call?
>>>>    - Is there case-insensitive way to count bigrams, or would I need
>>>>    to convert all the text to lowercase before calling huge-count.pl?
>>>>    - Would you consider modifying huge-count.pl so that the user can
>>>>    specify the final output filename, instead of just automatically calling
>>>>    the output file complete-huge-count.output?
>>>>
>>>> Thank you,
>>>> Catherine
>>>>
>>>> 
>>>>
>>>
>>>
>>
>

Re: [ngram] Re: Using huge-count.pl with lots of files

Reply via email to