There is not a way to make huge-count.pl (or count.pl) case insensitive. It will take the input pretty much "as is" and use that. So, I think you'd need to lower case your files before they made it to huge-count.pl. You can use --token to specify how you tokenize words (like do you treat don't as three tokens (don ' t) or one (don't). --stop lets you exclude words from being counted, but there isn't anything that lets you ignore case.
On Tue, Apr 17, 2018 at 8:51 AM, Ted Pedersen <tpede...@d.umn.edu> wrote: > Hi Catherine, > > Here are a few answers to your questions, hopefully. > > I don't think we'll be able to update this code anytime soon - we just > don't have anyone available to work on that right now, unfortunately. That > said we are very open to others making contributions, fixes, etc. > > The number of files that your system allows is pretty dependent on your > system and operating system. On Linux you can actually adjust that (if you > have sudo access) by running > > ulimit -s 100000 > > or > > ulimit -s unlimited > > This increases the number of processes that can be run at any one time, > which can allow your system to handle more command line arguments (since > each file name probably causes it's own process to be created...?, > speculating just a bit there) But if you don't have sudo access this is not > something you can do. > > As far as taking multiple outputs from huge-count.pl and merging them > with huge-merge, I think the answer is that's almost possible, but not > quite. huge-merge is not expecting the bigram count that appears on the > first line of huge-count.pl output to be there, and seems to fail as a > result. So you would need to remove that first line from your > huge-count.pl output before merging. > > The commands below kind of break down what is happening within > huge-count.pl. If you run this you can get an idea of the input output > expected by each stage... > > count.pl --tokenlist input1.out input1 > > count.pl --tokenlist input2.out input2 > > huge-sort.pl --keep input1.out > > huge-sort.pl --keep input2.out > > mkdir output-directory > > mv input1.out-sorted output-directory > > mv input2.out-sorted output-directory > > huge-merge.pl --keep output-directory > > I hope this helps. I realize it's not exactly a solution, but I hope it's > helpful all the same. I'll go through your notes again and see if there are > other issues to address...and of course if you try something and it does or > doesn't work I'm very interested in hearing about that... > > Cordially, > Ted > > > On Tue, Apr 17, 2018 at 7:33 AM, Ted Pedersen <tpede...@d.umn.edu> wrote: > >> The good news is that our documentation is more reliable than my memory. >> :) huge-count treats each file separately and so bigrams do not cross file >> boundaries. Having verified that I'll get back to your original question.. >> Sorry about the diversion and the confusion that might have caused. >> >> More soon, >> Ted >> >> On Mon, Apr 16, 2018 at 4:11 PM, Ted Pedersen <tpede...@d.umn.edu> wrote: >> >>> Let me go back and revisit this again, I seem to have confused myself! >>> >>> More soon, >>> Ted >>> >>> On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram] >>> <ngram@yahoogroups.com> wrote: >>> >>>> >>>> >>>> Did I misread the documentation then? >>>> >>>> "huge-count.pl doesn't consider bigrams at file boundaries. In other >>>> words, >>>> the result of count.pl and huge-count.pl on the same data file will >>>> differ if --newLine is not used, in that, huge-count.pl runs count.pl >>>> on multiple files separately and thus looses the track of the bigrams >>>> on file boundaries. With --window not specified, there will be loss >>>> of one bigram at each file boundary while its W bigrams with --window >>>> W." >>>> >>>> I thought that means bigrams won't cross from one file to the next? >>>> >>>> If bigrams don't cross from one file to the next, then I just need to >>>> run huge-count.pl on smaller inputs, then combine. So if I break >>>> @filenames into smaller subsets, then call huge-count.pl on the >>>> subsets, then call huge-merge.pl to combine the counts, I think that >>>> should work. >>>> >>>> I have a few more questions related to usage: >>>> >>>> - Do you know how many arguments are allowed for huge-count.pl? It >>>> would be good to know what size chunks I need to split my data into.. >>>> Or if >>>> not, then how would I do a try catch block to catch the error "Argument >>>> list to long" from the IPC::System::Simple::system call? >>>> - Is there case-insensitive way to count bigrams, or would I need >>>> to convert all the text to lowercase before calling huge-count.pl? >>>> - Would you consider modifying huge-count.pl so that the user can >>>> specify the final output filename, instead of just automatically calling >>>> the output file complete-huge-count.output? >>>> >>>> Thank you, >>>> Catherine >>>> >>>> >>>> >>> >>> >> >