: FWIW: I used the script below to build myself 3.8 million documents, with 
: 300 "text fields" consisting of anywhere from 1-10 "words" (integers 
: between 1 and 200)

Whoops ... forgot to post the script...


#!/usr/bin/perl

use strict;
use warnings;

my $num_docs = 3_800_000;
my $max_words_in_field = 10;
my $words_in_vocab = 200;
my $num_fields = 300;

# header
print "id";
map { print ",${_}_t" } 1..$num_fields;
print "\n";

while ($num_docs--) {
    print "$num_docs"; # uniqueKey
    for (1..$num_fields) {
        my $words_in_field = int(rand($max_words_in_field));
        print ",\"";
        map { print int(rand($words_in_vocab)) . " " } 0..$words_in_field;
        print "\"";
    }
    print "\n";
}


Reply via email to