hi mark & john

I wrote a script to parallelize the extract (step 5)
   scripts/generic/extract-parallel.perl
I haven't integrated it into the train-model.perl due to lack of time.

Step 6 can also be parallelize but you need to split the data a little more carefully. Attached is the script that is my start on this.

I'm not sure what script john is referring to but it'll be good to know

On 05/12/2011 23:09, John D Burger wrote:
Mark Fishel wrote:

the "--parallel" switch of the train-model.perl script is only
effective during the first 2 steps -- is there a good reason not to
make the phrase scoring (step 6) parallel? Currently it contains a
'for my $direction ("f2e","e2f")...', and on a large corpus the
scoring can take quite long -- so shouldn't it be straightforward and
also useful to run the two in parallel?
In fact, this step be parallelized far more than that pretty easily.  The 
wrapper script already splits the phrase table up into many smaller files, and 
than runs the phrase scorer executable on them sequentially.

- John Burger
   MITRE



_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
#! /usr/bin/perl -w

# example
#  ./score-parallel.perl 8 "./coreutils-8.9/src/sort --batch-size=253" ./score 
extract lex phrase.table.half ...

use strict;
use File::Basename;

sub NumStr($);

print "Started ".localtime() ."\n";

my $numParallel    = $ARGV[0];
my $sortCmd            = $ARGV[1];
my $scoreCmd    = $ARGV[2];

my $extract = $ARGV[3]; # 1st arg
my $lex = $ARGV[4]; # 2nd arg
my $ptHalf = $ARGV[5]; # 3rd

my $otherArgs    = "";
for (my $i = 6; $i < $#ARGV + 1; ++$i)
{
  $otherArgs .= $ARGV[$i] ." ";
}

my $TMPDIR=dirname($extract)  ."/tmp.$$";
mkdir $TMPDIR;

my $totalLines = int(`zcat $extract.gz | wc -l`);
my $linesPerSplit = int($totalLines / $numParallel) + 1;

print "total=$totalLines line-per-split=$linesPerSplit \n";

my $numExtract = splitNotOn1stColumn ("$extract.gz", $linesPerSplit, $TMPDIR, 
"extract");
my $numExtractInv = splitNotOn1stColumn ("$extract.inv.gz", $linesPerSplit, 
$TMPDIR, "extract.inv");

# run extract

#$cmd = "rm -rf $TMPDIR \n";
#print $cmd;
#`$cmd`;

print "Finished ".localtime() ."\n";


sub NumStr($)
{
    my $i = shift;
  my $numStr;
  if ($i < 10) {
    $numStr = "0000$i";
  }
  elsif ($i < 100) {
    $numStr = "000$i";
  }
  elsif ($i < 1000) {
     $numStr = "00$i";
  }
  elsif ($i < 10000) {
    $numStr = "0$i";
  }
  else {
    $numStr = $i;
  }
  return $numStr;
}

sub splitNotOn1stColumn($)
{
    my $extract = shift;
    my $numLines = shift;
  my $tmpDir = shift;
  my $prefix = shift;

  my $numSplits = 1;
  my $do = 1;
  my $lineCount = 0;

  my $filePath = "$tmpDir/$prefix." .NumStr($numSplits);
  open OUTFILE, ">", $filePath or die "Argg $filePath";
  open(INFILE, "gunzip -c $extract |") || die "Eeeee $filePath";
 
  my $prevSource;
  while (my $line = <INFILE>)
  {
    if ($lineCount < $numLines)
    { # still under line count. do nothing
    }
    elsif ($lineCount < $numLines)
    { # start thinking about splitting
      $prevSource = get1stColumn($line);
    }
    else
    { # should we split ?
      my $thisSource = get1stColumn($line);
      if ($prevSource eq $thisSource)
      { # same source, can't split
        # do nothing
      }
      else
      { # split
        close OUTFILE;
       
        ++$numSplits;
        $filePath = "$tmpDir/$prefix." .NumStr($numSplits);
        open OUTFILE, ">", $filePath or die "Argg $filePath";
       
        $lineCount = 0;
      }
    }

    print OUTFILE $line;

    ++$lineCount;
  }

  close INFILE;
  close OUTFILE;

}

sub get1stColumn($)
{
  my $line = shift;
  my @toks = split(/[|][|][|]/, $line);
  my $source = $toks[0];

  return $source;
}

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to