RE: ideas for indexing large amount of pdf docs

Jaeger, Jay - DOT Mon, 15 Aug 2011 05:55:04 -0700

Note on i:  Solr replication provides pretty good clustering support 
out-of-the-box, including replication of multiple cores.  Read the Wiki on 
replication (Google +solr +replication if you don't know where it is).


In my experience, the problem with indexing PDFs is it takes a lot of CPU on 
the document parsing side (client), not on the Solr server side.  So make sure 
you do that part on the client and not the server.

Avoiding iii:


Suggest that you write yourself a multi-threaded performance test so that you 
aren't guessing what your performance will be.

We wrote one in Perl.  It handles an individual thread (we were testing 
inquiry), and we wrote a little batch file / shell script to start up the 
desired number of threads.

The main statement in our batch file (the rest just set the variables).  A  
Shell script would be even easier.

for /L %%i in (1,1,%THREADS%) DO start /B perl solrtest.pl -h %SOLRHOST% 
-c %COUNT% -u %1 -p %2 -r %SOLRREALM% -f %SOLRLOC%\firstsynonyms.txt -l 
%SOLRLOC%\lastsynonyms.txt -z %FUZZ%

The perl


#!/usr/bin/perl

#
#       Perl program to run a thread of solr testing
#

use Getopt::Std;                        # For options processing
use POSIX;                              # For time formatting
use XML::Simple;                        # For processing of XML config file
use Data::Dumper;                       # For debugging XML config file
use HTTP::Request::Common;              # For HTTP request to Solr
use HTTP::Response;
use LWP::UserAgent;                     # For HTTP request to Solr

$host = "YOURHOST:8983";
$realm = "YOUR AUTHENTICATION REALM";
$firstlist = "firstsynonyms.txt";
$lastlist = "lastsynonyms.txt";
$fuzzy = "";

$me = $0;

sub usage() {
        print "perl $me -c iterations [-d] [-h host:port ] [-u user [-p 
password]] \n";
        print "\t\t[-f firstnamefile] [-l lastnamefile] [-z fuzzy] [-r 
realm]\n";
        exit(8);
}


#
#       Process the command line options, and open the output file.
#

getopts('dc:u:p:f:l:h:r:z:') || usage();

if(!$opt_c) {
        usage();
}

$count = $opt_c;

if($opt_u) {
        $user = $opt_u;
}

if($opt_p) {
        $password = $opt_p;
}

if($opt_h) {
        $host = $opt_h;
}

if($opt_f) {
        $firstlist = $opt_f;
}

if($opt_l) {
        $lastlist = $opt_l;
}

if($opt_r) {
        $realm = $opt_r;
}

if($opt_z) {
        $fuzzy = "~" . $opt_z;
}

$debug = $opt_d;


#
#       If the host string does not include a :, add :80
#

if($host !~ /:/) {
        $host = $host . ":80";
}

#
#       Read the lists of first and last names
#

open(SYNFILE,"<$firstlist") || die "Can't open first name list $firstlist\n";
while(<SYNFILE>) {
        @newwords = split /,/;
        for($i=0; $i <= $#newwords; ++$i) {
                $newwords[$i] =~ s/^\s+//;
                $newwords[$i] =~ s/\s+$//;
                $newwords[$i] = lc($newwords[$i]);
        }
        push @firstnames, @newwords;
}
close(SYNFILE);

open(SYNFILE,"<$lastlist") || die "Can't open last name list $lastlist\n";
while(<SYNFILE>) {
        @newwords = split /,/;
        for($i=0; $i <= $#newwords; ++$i) {
                $newwords[$i] =~ s/^\s+//;
                $newwords[$i] =~ s/\s+$//;
                $newwords[$i] = lc($newwords[$i]);
        }
        push @lastnames, @newwords;
}
close(SYNFILE);


print "$#firstnames First Names, $#lastnames Last Names\n";
print "User: $user\n";


my $userAgent = LWP::UserAgent->new(agent => 'solrtest.pl');
$userAgent->credentials("$host",$realm,$user,$password);

$uri = "http://$host/solr/select";;

$starttime = time();

for($c=0; $c < $count; ++$c) {
        $fname = $firstnames[rand $#firstnames];
        $lname = $lastnames[rand $#lastnames];
        $response = $userAgent->request(
                POST $uri,
                [ 
                        q => "lnamesyn:$lname AND fnamesyn:$fname$fuzzy",
                        rows => "25"
                ]);

        if($debug) {
                print "Query: lnamesyn:$lname AND fnamesyn:$fname$fuzzy";
                print $response->content();
        }
        print "POST for $fname $lname completed, HTTP status=" . 
$response->code . "\n";
}

$elapsed = time() - $starttime;
$average = $elapsed / $count;

print "Time: $elapsed s ($average/request)\n";


-----Original Message-----
From: Rode Gonzalez (libnova) [mailto:r...@libnova.es] 
Sent: Saturday, August 13, 2011 3:50 AM
To: solr-user@lucene.apache.org
Subject: ideas for indexing large amount of pdf docs

Hi all,

I want to ask about the best way to implement a solution for indexing a 
large amount of pdf documents between 10-60 MB each one. 100 to 1000 users 
connected simultaneously.

I actually have 1 core of solr 3.3.0 and it works fine for a few number of 
pdf docs but I'm afraid about the moment when we enter in production time.

some possibilities:

i. clustering. I have no experience in this, so it will be a bad idea to 
venture into this.

ii. multicore solution. make some kind of hash to choose one core at each 
query (exact queries) and thus reduce the size of the individual indexes to 
consult or to consult all the cores at same time (complex queries).

iii. do nothing more and wait for the catastrophe in the response times :P


Someone with experience can help a bit to decide?

Thanks a lot in advance.

RE: ideas for indexing large amount of pdf docs

Reply via email to