Hi All, I currently have Tesseract implemented within a PERL module with the import, "use Image::OCR::Tesseract 'get_ocr';". The PERL module is designed to do web-scrapping, particularly to scrap data from the website www.hoteltravel.com. Tesseract comes into play when doing extraction of rates from the rate breakup (per day price breakdown of a room). The website stores it's room pricing data within a PNG image file for which Tesseract is used to extract the text from the image.
The issue I'm currently facing revolves around the improper conversion of these rates and currencies when converting the image to text using 'get_ocr'. Sample code is provided below showing how I'm using Tesseract and the clean up of the extracted rates. Further specifics on the issue at hand are stated below the sample code. $agent->save_content("/tmp/rate_img.png");# Saving the content in temporary file my $rate_img = get_ocr('/tmp/rate_img.png'); # Convert image to text system ("rm /tmp/rate_img.png"); # Deleting the temporary file my ($rate_per_day) = $rate_img =~ m!U\d{2}\s*([\d\,\.]+)!is; # Extract Rate from text ############################## #Perform Cleanup of Extracted Rate ############################## ($rate_per_day) = $rate_img =~ m!U\w{1}\$([^.]*).!is if (!$rate_per_day); ($rate_per_day) = $rate_img =~ m!U\d{2}|\$\s*([\d\,\.]+)!is if (!$rate_per_day); ($rate_per_day) = $rate_img =~ m!U\w{2}\s*([\w\,\.]+)!is if (!$rate_per_day); ($rate_per_day) = $rate_img =~ m!£([\d\,\.]+)!is if (!$rate_per_day); $rate_per_day =~ s!E!8!isg; $rate_per_day =~ s!L!4.!isg; $rate_per_day =~ s!\,!!sg; $rate_img = &make_ascii_text($rate_img, 'utf-8'); $currency = 'EUR' if ($rate_img =~ m!ae?!is); $currency = 'THB' if ($rate_img =~ m!as!is); $currency = 'USD' if ($rate_img =~ m!USS!is); $currency = 'USD' if ($rate_img =~ m!U55|US\$!is); $currency = 'GBP' if ($rate_img =~ m!a?!is); *S. No* *Criteria* *Image * *Text after conversion* 1. (run module) --arv_dt=2013-02-24 --los=11 --guests=1 (los = Length of Stay) *US$118.00* *U5511E.OO* * * 2. (run module) --arv_dt=2013-02-24 --prop_id=15 --los=11 --guests=1 (los = Length of Stay) *EUR 97.00* *⬠97.00* 3. (run module) --arv_dt=2013-02-24 --prop_id=16316 --los=11 --guests=1 (los = Length of Stay) *USD 789.75* *U55 7E9.75* 4. (run module) --arv_dt=2013-02-24 --prop_id=1553 --los=11 --guests=1 (los = Length of Stay) *US$ 84.15* *US$ 96.82* *U55 EL15* *U55 96.52* *Main Issues:- * It convert ‘*EUR*’ to ‘*â¬*’ which changes for almost all currencies. It converts ‘*US$ 84.15*’ to ‘*U55 EL15*’, here it converts ‘*4.*’ to ‘*L*’ It converts ‘*US$ 96.82*’ to ‘*U55 96.52 *’, here it converts ‘*8*’ to ‘*5*’ If anyone has encountered an issue like this, or would know of a more flexible solution to solve this issue, any help would be much appreciated? Thanks -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.