Hi All,

I currently have Tesseract implemented within a PERL module with the 
import, "use Image::OCR::Tesseract 'get_ocr';". The PERL module is designed 
to do web-scrapping, particularly to scrap data from the website 
www.hoteltravel.com. Tesseract comes into play when doing extraction of 
rates from the rate breakup (per day price breakdown of a room). The 
website stores it's room pricing data within a PNG image file for which 
Tesseract is used to extract the text from the image. 

The issue I'm currently facing revolves around the improper conversion of 
these rates and currencies when converting the image to text using 
'get_ocr'. Sample code is provided below showing how I'm using Tesseract 
and the clean up of the extracted rates. Further specifics on the issue at 
hand are stated below the sample code.

$agent->save_content("/tmp/rate_img.png");# Saving the content in temporary 
file

my $rate_img = get_ocr('/tmp/rate_img.png'); # Convert image to text

system ("rm /tmp/rate_img.png"); # Deleting the temporary file
my ($rate_per_day) = $rate_img =~ m!U\d{2}\s*([\d\,\.]+)!is; # Extract Rate 
from text

##############################
#Perform Cleanup of Extracted Rate
##############################

($rate_per_day) = $rate_img =~ m!U\w{1}\$([^.]*).!is if (!$rate_per_day);
($rate_per_day) = $rate_img =~ m!U\d{2}|\$\s*([\d\,\.]+)!is if 
(!$rate_per_day);
($rate_per_day) = $rate_img =~ m!U\w{2}\s*([\w\,\.]+)!is if 
(!$rate_per_day);
($rate_per_day) = $rate_img =~ m!£([\d\,\.]+)!is if (!$rate_per_day);
$rate_per_day    =~ s!E!8!isg;
$rate_per_day    =~ s!L!4.!isg;
$rate_per_day =~ s!\,!!sg;
$rate_img = &make_ascii_text($rate_img, 'utf-8');

$currency = 'EUR' if ($rate_img =~ m!ae?!is);
$currency = 'THB' if ($rate_img =~ m!as!is);
$currency = 'USD' if ($rate_img =~ m!USS!is);
$currency = 'USD' if ($rate_img =~ m!U55|US\$!is);
$currency = 'GBP' if ($rate_img =~ m!a?!is); 

 *S. No*
 
*Criteria*
 
*Image *
 
*Text after conversion*
  
1.
 
(run module) --arv_dt=2013-02-24  --los=11 --guests=1 (los = Length of Stay)
 
*US$118.00*
 
*U5511E.OO*

* *
  
2.
 
(run module) --arv_dt=2013-02-24 --prop_id=15    --los=11 --guests=1 (los = 
Length of Stay)

 
 
*EUR 97.00*
 
*⬠97.00*
  
3.
 
(run module) --arv_dt=2013-02-24 --prop_id=16316    --los=11 --guests=1 (los 
= Length of Stay)
 
*USD 789.75*
 
*U55 7E9.75*
  
4.
 
(run module) --arv_dt=2013-02-24 --prop_id=1553 --los=11 --guests=1 (los = 
Length of Stay)
 
*US$ 84.15*

*US$ 96.82*
 
*U55 EL15*

*U55 96.52*
 


















*Main Issues:- *

It convert ‘*EUR*’ to ‘*â¬*’ which changes for almost all currencies.

It converts ‘*US$ 84.15*’  to ‘*U55 EL15*’, here it converts ‘*4.*’ to ‘*L*’

It converts ‘*US$ 96.82*’ to ‘*U55 96.52 *’, here it converts ‘*8*’ to ‘*5*’
If anyone has encountered an issue like this, or would know of a more 
flexible solution to solve this issue, any help would be much appreciated?

Thanks

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to