Re: [Ankur-core] Bangla OCR progress
On 5/9/09, Debayan Banerjee debaya...@gmail.com wrote: 2009/5/9 Deepayan Sarkar deepayan.sar...@gmail.com: Debayan, I have been meaning to ask you: is your character segmentation algorithm in a form that could be easily separated out? The segmentation algorithm can be found here (http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf) But this is your original algorithm which segmented গ etc (at least for some fonts), isn't it? I thought you had an improved algorithm which works around some of those problems (or maybe I misunderstood your mail). If it could be easily done, I would like to try it out in BOCRA. Unfortunately, I don't think I will have enough time in the near future to figure out how ocropus/tesseract does things. Kindly read the paragraph in this (http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html) post regarding reducing number of character classes to be trained. I want to know if this is possible using BOCRA. No it's not. From the beginning, my design for BOCRA was based on the idea of on-the-fly training, because that's the only approach I thought was feasible given the combination of non-standard fonts and so many potential conjuncts. In most realistic examples, the number of conjuncts is actually quite limited. After accounting for the most common ones, the frequency of the rest are probably lower than normal OCR error rate anyway. -Deepayan -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] Bangla OCR progress
2009/4/20 srhaque srha...@theiet.org: BTW, if you still need my test file with conjunct samples, here it is... Thank you very much. They have proved *very helpful* :) I preapred this (http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html) post with the help of your document. -- Regards, Debayan Banerjee Support Free Software http://deeproot.in -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] Bangla OCR progress
On Friday 08 May 2009, Debayan Banerjee wrote: 2009/4/20 srhaque srha...@theiet.org: BTW, if you still need my test file with conjunct samples, here it is... Thank you very much. They have proved *very helpful* :) I preapred this (http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html) post with the help of your document. Cool. If it is of any use, then note that my Raga font also has glyphs for all the conjuncts (though I've not anything with the advanced tables to refine the font generally). I've been thinking about OCR for a little while too, and am doing some little experiments here and there based on trying to apply brute force to simple algorithms for deskewing/text-block extraction/segmentation. However, I'm a bit stuck for inspiration on that front for now, so if there is anything I can do to help *you*, please let me know. Thanks, Shaheed -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] Bangla OCR progress
Debayan, I have been meaning to ask you: is your character segmentation algorithm in a form that could be easily separated out? If it could be easily done, I would like to try it out in BOCRA. Unfortunately, I don't think I will have enough time in the near future to figure out how ocropus/tesseract does things. -Deepayan -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] Bangla OCR progress
BTW, if you still need my test file with conjunct samples, here it is... Copyright (c) 2007, 2008 S.R.Haque (srha...@theiet.org). Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled GNU Free Documentation License. CHANGE LOG 2007-04-16 S.R.Haque First released. 2008-01-06 S.R.Haque Adding missing conjuncts of ঈ. SAMPLE TEXT জনগণমন-অধিনায়ক জয় হে ভারতভাগ্যবিধাতা! পঞ্জাব সিন্ধু গুজরাট মরাঠা দ্রাবিড় উত্কল বঙ্গ বিন্ধ্য হিমাচল যমুনা গঙ্গা উচ্ছলজলধিতরঙ্গ তব শুভ নামে জাগে, তব শুভ আশিস মাগে, গাহে তব জয়গাথা। জনগণমঙ্গলদায়ক জয় হে ভারতভাগ্যবিধাতা! জয় হে, জয় হে, জয় হে, জয় জয় জয়, জয় হে॥ জনগণমন-অধিনায়ক জয় হে ভারতভাগ্যবিধাতা! UNICODE 5.0 BENGALI CHARACTER CODES 098x ঁ ং ঃ অ আ ই ঈ উ ঊ ঋ ঌ এ 099x ঐ ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট 09Ax ঠ ড ঢ ণ ত থ দ ধ ন প ফ ব ভ ম য 09Bx র ল শ ষ স হ ় ঽ া ি 09Cx ী ু ূ ৃ ৄ ে ৈ ো ৌ ্ ৎ ৗ ড় ঢ় য় 09Dx ৠ ৡ ০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯ ৰ ৱ ৲ ৳ ৴ ৵ ৶ ৷ ৸ ৹ ৺ ASCII 0020 ! # $ % ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; = ? 0030 @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ 0040 ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ PUNCTUATION AND SPECIAL SYMBOLS ৢ ৣ ‘ ’ “ ” VOWEL CONJUNCTS Conjuncts of আ (া) কা খা গা ঘা ঙা চা ছা জা ঝা ঞা টা ঠা ডা ঢা ণা তা থা দা ধা না পা ফা বা ভা মা যা রা লা শা ষা সা হা ড়া ঢ়া য়া ৰা ৱা Conjuncts of ই (ি) কি খি গি ঘি ঙি চি ছি জি ঝি ঞি টি ঠি ডি ঢি ণি তি থি দি ধি নি পি ফি বি ভি মি যি রি লি শি ষি সি হি ড়ি ঢ়ি য়ি ৰি ৱি Conjuncts of ঈ (ী) কী খী গী ঘী ঙী চী ছী জী ঝী ঞী টী ঠী ডী ঢী ণী তী থী দী ধী নী পী ফী বী ভী মী যী রী লী শী ষী সী হী ড়ী ঢ়ী য়ী ৰী ৱী Conjuncts of উ (ু) কু খু গু ঘু ঙু চু ছু জু ঝু ঞু টু ঠু ডু ঢু ণু তু থু দু ধু নু পু ফু বু ভু মু যু রু লু শু ষু সু হু ড়ু ঢ়ু য়ু ৰু ৱু Conjuncts of ঊ (ূ) কূ খূ গূ ঘূ ঙূ চূ ছূ জূ ঝূ ঞূ টূ ঠূ ডূ ঢূ ণূ তূ থূ দূ ধূ নূ পূ ফূ বূ ভূ মূ যূ রূ লূ শূ ষূ সূ হূ ড়ূ ঢ়ূ য়ূ ৰূ ৱূ Conjuncts of ঋ (ৃ) কৃ খৃ গৃ ঘৃ ঙৃ চৃ ছৃ জৃ ঝৃ ঞৃ টৃ ঠৃ ডৃ ঢৃ ণৃ তৃ থৃ দৃ ধৃ নৃ পৃ ফৃ বৃ ভৃ মৃ যৃ রৃ লৃ শৃ ষৃ সৃ হৃ ড়ৃ ঢ়ৃ য়ৃ ৰৃ ৱৃ Conjuncts of ঌ (ৄ) কৄ খৄ গৄ ঘৄ ঙৄ চৄ ছৄ জৄ ঝৄ ঞৄ টৄ ঠৄ ডৄ ঢৄ ণৄ তৄ থৄ দৄ ধৄ নৄ পৄ ফৄ বৄ ভৄ মৄ যৄ রৄ লৄ শৄ ষৄ সৄ হৄ ড়ৄ ঢ়ৄ য়ৄ ৰৄ ৱৄ Conjuncts of এ (ে) কে খে গে ঘে ঙে চে ছে জে ঝে ঞে টে ঠে ডে ঢে ণে তে থে দে ধে নে পে ফে বে ভে মে যে রে লে শে ষে সে হে ড়ে ঢ়ে য়ে ৰে ৱে Conjuncts of ঐ (ৈ) কৈ খৈ গৈ ঘৈ ঙৈ চৈ ছৈ জৈ ঝৈ ঞৈ টৈ ঠৈ ডৈ ঢৈ ণৈ তৈ থৈ দৈ ধৈ নৈ পৈ ফৈ বৈ ভৈ মৈ যৈ রৈ লৈ শৈ ষৈ সৈ হৈ ড়ৈ ঢ়ৈ য়ৈ ৰৈ ৱৈ Conjuncts of ও (ো) কো খো গো ঘো ঙো চো ছো জো ঝো ঞো টো ঠো ডো ঢো ণো তো থো দো ধো নো পো ফো বো ভো মো যো রো লো শো ষো সো হো ড়ো ঢ়ো য়ো ৰো ৱো Conjuncts of ঔ (ৌ) কৌ খৌ গৌ ঘৌ ঙৌ চৌ ছৌ জৌ ঝৌ ঞৌ টৌ ঠৌ ডৌ ঢৌ ণৌ তৌ থৌ দৌ ধৌ নৌ পৌ ফৌ বৌ ভৌ মৌ যৌ রৌ লৌ শৌ ষৌ সৌ হৌ ড়ৌ ঢ়ৌ য়ৌ ৰৌ ৱৌ CONSONANT CONJUNCTS With Hasanta (্) ক্ খ্ গ্ ঘ্ ঙ্ চ্ ছ্ জ্ ঝ্ ঞ্ ট্ ঠ্ ড্ ঢ্ ণ্ ত্ থ্ দ্ ধ্ ন্ প্ ফ্ ব্ ভ্ ম্ য্ র্ ল্ শ্ ষ্ স্ হ্ ড়্ ঢ়্ য়্ ৰ্
Re: [Ankur-core] Bangla OCR progress
I take the liberty of top posting since i copied the mail's contents from archives and bottom posting will require messing with the text below to much. In reply to this particular line: It takes the old matra removal approach, and he's facing the same problems I did (notice in his first example that গ is segmented into 2 parts, and শু is not). Kindly see http://picasaweb.google.com/debayanin/TesseractIndicOCR#5325782929614608690. Below is the original conversation. On 7/2/08, Golam Mortuza Hossain [EMAIL PROTECTED] wrote: On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta [EMAIL PROTECTED] This guy seems to be doing some interesting progress for a Bangla OCR - or more precisely, enabling Bangla in Tesseract. http://debayanin.googlepages.com/hackingtesseract Cool. I had some interaction with the tesseract/ocropus folks, and it sounded like a good base. It's nice that someone's actually doing something with it. It takes the old matra removal approach, and he's facing the same problems I did (notice in his first example that গ is segmented into 2 parts, and শু is not). On the other hand, having something that works even partly is a good start. Yes, it looks definitely interesting. Looks like he needs some more training data - can we provide him with some ? If I remember correctly, there was a sample file for testing completeness of Bengali fonts. Since it has all letters and conjuncts typed-in, the file might be useful for training Tesseract as well . Deepayan should be able to give some input here. He has working experience with R and may have some training sample as well. Well, we have a bunch of unicode documents. For some of them, I have print versions too, and can scan them if needed. A simpler approach would be to render them using different fonts and take screenshots. Apparently he also needs some box-files, whatever they are, which need to be produced using tesseract. I haven't installed tesseract yet, and will try, but let me know if anyone else manages. -Deepayan -- Be Intelligent, Use GNU/Linux http://debayanin.googlepages.com/ http://debayan.wordpress.com http://lug.nitdgp.ac.in -- Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] Bangla OCR progress
On Apr 19, 2009, at 5:16 AM, Debayan Banerjee wrote: I take the liberty of top posting since i copied the mail's contents from archives and bottom posting will require messing with the text below to much. In reply to this particular line: It takes the old matra removal approach, and he's facing the same problems I did (notice in his first example that গ is segmented into 2 parts, and শু is not). Kindly see http://picasaweb.google.com/debayanin/TesseractIndicOCR#5325782929614608690 . Below is the original conversation. On 7/2/08, Golam Mortuza Hossain [EMAIL PROTECTED] wrote: On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta [EMAIL PROTECTED] This guy seems to be doing some interesting progress for a Bangla OCR - or more precisely, enabling Bangla in Tesseract. http://debayanin.googlepages.com/hackingtesseract Cool. I had some interaction with the tesseract/ocropus folks, and it sounded like a good base. It's nice that someone's actually doing something with it. It takes the old matra removal approach, and he's facing the same problems I did (notice in his first example that গ is segmented into 2 parts, and শু is not). On the other hand, having something that works even partly is a good start. Yes, it looks definitely interesting. Looks like he needs some more training data - can we provide him with some ? If I remember correctly, there was a sample file for testing completeness of Bengali fonts. Since it has all letters and conjuncts typed-in, the file might be useful for training Tesseract as well . Deepayan should be able to give some input here. He has working experience with R and may have some training sample as well. Well, we have a bunch of unicode documents. For some of them, I have print versions too, and can scan them if needed. A simpler approach would be to render them using different fonts and take screenshots. Apparently he also needs some box-files, whatever they are, which need to be produced using tesseract. I haven't installed tesseract yet, and will try, but let me know if anyone else manages. -Deepayan Dear all, I was working with OCR for my university. I took most of the idea from bocra.sourceforge.net It is written using graphicsmagick library C++. Any suggestion from you about matching alphabet. Here is my progress http://picasaweb.google.com/salahuddin66/OCR# regards salahuddin salahuddin66.blogspot.com -- Be Intelligent, Use GNU/Linux http://debayanin.googlepages.com/ http://debayan.wordpress.com http://lug.nitdgp.ac.in -- Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core -- Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] Bangla OCR progress
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Sayamindu Dasgupta wrote: | This guy seems to be doing some interesting progress for a Bangla OCR | - or more precisely, enabling Bangla in Tesseract. | http://debayanin.googlepages.com/hackingtesseract | Looks like he needs some more training data - can we provide him with some ? As an aside, he is working with the Swatantra Malayalam Computing group to fix OCR issues in ml_IN too. And, I'd request someone to validate how much progress he is making in terms of attaining accuracy. - -- You see things; and you say 'Why?'; But I dream things that never were; and I say 'Why not?' - George Bernard Shaw www.linkedin.com/in/sankarshan -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iEYEARECAAYFAkhrhSkACgkQXQZpNTcrCzOCZACgjLgyl75jk88pAnNcJqki8/zL 2YsAoIxueuNMbpoCKIK8yXFBVF1gr0M9 =S+gd -END PGP SIGNATURE- - Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! Studies have shown that voting for your favorite open source project, along with a healthy diet, reduces your potential for chronic lameness and boredom. Vote Now at http://www.sourceforge.net/community/cca08 ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] Bangla OCR progress
On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta [EMAIL PROTECTED] This guy seems to be doing some interesting progress for a Bangla OCR - or more precisely, enabling Bangla in Tesseract. http://debayanin.googlepages.com/hackingtesseract Yes, it looks definitely interesting. Looks like he needs some more training data - can we provide him with some ? If I remember correctly, there was a sample file for testing completeness of Bengali fonts. Since it has all letters and conjuncts typed-in, the file might be useful for training Tesseract as well . Deepayan should be able to give some input here. He has working experience with R and may have some training sample as well. Cheers, Golam -- http://gravity.psu.edu/~hossain/ - Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! Studies have shown that voting for your favorite open source project, along with a healthy diet, reduces your potential for chronic lameness and boredom. Vote Now at http://www.sourceforge.net/community/cca08 ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] Bangla OCR progress
On 7/2/08, Golam Mortuza Hossain [EMAIL PROTECTED] wrote: On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta [EMAIL PROTECTED] This guy seems to be doing some interesting progress for a Bangla OCR - or more precisely, enabling Bangla in Tesseract. http://debayanin.googlepages.com/hackingtesseract Cool. I had some interaction with the tesseract/ocropus folks, and it sounded like a good base. It's nice that someone's actually doing something with it. It takes the old matra removal approach, and he's facing the same problems I did (notice in his first example that গ is segmented into 2 parts, and শু is not). On the other hand, having something that works even partly is a good start. Yes, it looks definitely interesting. Looks like he needs some more training data - can we provide him with some ? If I remember correctly, there was a sample file for testing completeness of Bengali fonts. Since it has all letters and conjuncts typed-in, the file might be useful for training Tesseract as well . Deepayan should be able to give some input here. He has working experience with R and may have some training sample as well. Well, we have a bunch of unicode documents. For some of them, I have print versions too, and can scan them if needed. A simpler approach would be to render them using different fonts and take screenshots. Apparently he also needs some box-files, whatever they are, which need to be produced using tesseract. I haven't installed tesseract yet, and will try, but let me know if anyone else manages. -Deepayan - Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! Studies have shown that voting for your favorite open source project, along with a healthy diet, reduces your potential for chronic lameness and boredom. Vote Now at http://www.sourceforge.net/community/cca08 ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core