sir,
i gone through most of the ocr technologies and reached a conclusion.i
would like to use apache tika and java ocr for this pupose.

Tessearact is a ocr tool,it can be used for extracting from multiple
languages.it is implemented in vc++.so it can acceded using java native
function.they provided another  tool tess4j but review says that it has
many bugs.

Apache tika developed in java language.it can be used to extract text data
from .xls,word,txt,pdf and other many formats.it is easy for implementing
in project also.i have just gone through its implementation way.

then about javaocr,its good for extrating text from a jpeg or scanned
images.we can train it with various fonts.more we train more will be its
accuracy but its speed will get decreased.i didn't find any particular
documentation for that.



On Sun, Jul 14, 2013 at 9:18 PM, sandeep rg <sandeep.f...@gmail.com> wrote:

> thanks a lot for both of your support.I will do my best to find solution
> for jira problem.i will share the proposal with both of you..
>
>
>
> On Sun, Jul 14, 2013 at 1:46 AM, Chen, Pei <pei.c...@childrens.harvard.edu
> > wrote:
>
>> Sandeep,
>> Its great to have Chris on board as well- he was one of the coordinators
>> of GSoC.
>> Looking forward to it.
>>
>> Sent from my iPhone
>>
>> On Jul 13, 2013, at 12:24 PM, "Mattmann, Chris A (398J)" <
>> chris.a.mattm...@jpl.nasa.gov> wrote:
>>
>> > Hi Sandeep,
>> >
>> > That is great news, and good job. OK, for some ideas about developing
>> > your proposal, you may want to simply start with a Google Docs, and then
>> > share it with Pei. I'd be happy to help co-mentor if Pei and you think
>> > it's useful too.
>> >
>> > Your proposal should likely cover:
>> >
>> > 1. Background - what's the state of CTAKES-189 and what's it trying to
>> > accomplish
>> >  (include some figures, etc. along with your text)
>> >
>> > 2. Approach - what are you going to do to solve CTAKES-189. Be specific,
>> > and
>> >  try to break it down into smaller, easily reversible steps
>> >
>> > 3. Schedule - how long and what is the schedule for achieving this?
>> >
>> > 4. Risks/etc. - any known risks like are you taking a vacation anytime
>> > soon :)
>> >  or are there other time constraints?
>> >
>> > 5. References, etc.
>> >
>> > HTH and I'd be happy if you want to share the GDocs with me as you
>> develop
>> > it.
>> >
>> > Cheers!
>> >
>> > Chris
>> >
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > Chris Mattmann, Ph.D.
>> > Senior Computer Scientist
>> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> > Office: 171-266B, Mailstop: 171-246
>> > Email: chris.a.mattm...@nasa.gov
>> > WWW:  http://sunset.usc.edu/~mattmann/
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > Adjunct Assistant Professor, Computer Science Department
>> > University of Southern California, Los Angeles, CA 90089 USA
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >
>> >
>> >
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: sandeep rg <sandeep.f...@gmail.com>
>> > Reply-To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
>> > Date: Saturday, July 13, 2013 8:57 AM
>> > To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
>> > Subject: Re: to involve in your development group
>> >
>> >> i have also gone through the technologies available for development of
>> >> ocr,from that i think apache tika and tessearact is best for resolving
>> the
>> >> problem.
>> >>
>> >>
>> >> On Sat, Jul 13, 2013 at 9:02 PM, sandeep rg <sandeep.f...@gmail.com>
>> >> wrote:
>> >>
>> >>> hi Mattamann Chris,
>> >>> i has participated in the event coordinated by luciano resende
>> >>>
>> >>> http://community.apache.org/mentoringprogramme-icfoss-pilot.html
>> >>>
>> >>> and from that i learned about open source and like to work on your
>> >>> project
>> >>> ctakes.i would like to fix the jira
>> >>>
>> >>> https://issues.apache.org/jira/browse/CTAKES-189
>> >>>
>> >>> chen pei accepted my requested to be my mentor.now i want to give a
>> >>> proposal to apache about the project i am going to work on.can you
>> help
>> >>> me
>> >>> to prepare a proposal to be submitted before 18 th of this july.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Jul 13, 2013 at 2:26 AM, Mattmann, Chris A (398J) <
>> >>> chris.a.mattm...@jpl.nasa.gov> wrote:
>> >>>
>> >>>> Hi Sandeep,
>> >>>>
>> >>>> I think the best thing to do is:
>> >>>>
>> >>>> 1. Develop a JIRA issue here:
>> >>>> https://issues.apache.org/jira/browse/CTAKES
>> >>>> 1a. you can register for a new account on JIRA
>> >>>> 2. Once your JIRA issue is created, feel free to start a [DISCUSS]
>> >>>> thread
>> >>>> (e.g., with subject [DISCUSS] "some topic" where "some topic" is
>> >>>> perhaps
>> >>>> the main idea you have) on dev@ctakes.apache.org, referencing your
>> >>>> issue
>> >>>> and
>> >>>> asking for feedback
>> >>>> 3. Work with the Apache cTAKES PMC and committers to get your patches
>> >>>> and
>> >>>> other items attached to your issue from #1 committed into the sources
>> >>>>
>> >>>> Ideally if 1-3 happen and it's a good interaction, Apache is built on
>> >>>> meritocracy and you could possibly earn the merit to become a PMC
>> >>>> member
>> >>>> or committer on the project.
>> >>>>
>> >>>> Cheers,
>> >>>> Chris
>> >>>>
>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>> Chris Mattmann, Ph.D.
>> >>>> Senior Computer Scientist
>> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>>> Office: 171-266B, Mailstop: 171-246
>> >>>> Email: chris.a.mattm...@nasa.gov
>> >>>> WWW:  http://sunset.usc.edu/~mattmann/
>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>> Adjunct Assistant Professor, Computer Science Department
>> >>>> University of Southern California, Los Angeles, CA 90089 USA
>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> -----Original Message-----
>> >>>> From: sandeep rg <sandeep.f...@gmail.com>
>> >>>> Reply-To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
>> >>>> Date: Thursday, July 11, 2013 11:30 AM
>> >>>> To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
>> >>>> Subject: Re: to involve in your development group
>> >>>>
>> >>>>> can you provide what all details i should include in a
>> >>>> proposal?whether i
>> >>>>> wanted to include all implemetation(technical) details in the
>> >>>> proposal?
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Jul 11, 2013 at 9:45 PM, Mattmann, Chris A (398J) <
>> >>>>> chris.a.mattm...@jpl.nasa.gov> wrote:
>> >>>>>
>> >>>>>> Dear Sandeep,
>> >>>>>>
>> >>>>>> Thanks for your interest in cTAKES. We would welcome your
>> >>>> contribution
>> >>>>>> and are happy to have your interest in the project.
>> >>>>>>
>> >>>>>> Cheers,
>> >>>>>> Chris
>> >>>>>>
>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>>> Chris Mattmann, Ph.D.
>> >>>>>> Senior Computer Scientist
>> >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>>>>> Office: 171-266B, Mailstop: 171-246
>> >>>>>> Email: chris.a.mattm...@nasa.gov
>> >>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>>> Adjunct Assistant Professor, Computer Science Department
>> >>>>>> University of Southern California, Los Angeles, CA 90089 USA
>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -----Original Message-----
>> >>>>>> From: sandeep rg <sandeep.f...@gmail.com>
>> >>>>>> Reply-To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
>> >>>>>> Date: Wednesday, July 10, 2013 11:01 AM
>> >>>>>> To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
>> >>>>>> Subject: Re: to involve in your development group
>> >>>>>>
>> >>>>>>> sir,
>> >>>>>>>
>> >>>>>>> My name is sandeep rg.i am a btech graduate in computer
>> science.now
>> >>>>>> doing
>> >>>>>>> an internship in a company in java language.
>> >>>>>>>
>> >>>>>>> then  i had installed all things succesfully,now downloading the
>> >>>>>>> resource.ittake too much time.
>> >>>>>>>
>> >>>>>>> i have gone through the suggested ocr technologies.
>> >>>>>>> Javaocr has some good user review.
>> >>>>>>> Apache tika has a capability to process different types of format.
>> >>>>>>> More than that there is tesserract which are also used for ocr
>> >>>> purpose.
>> >>>>>>> then apache pdfbox is also used for text extratcion but only for
>> >>>> pdf
>> >>>>>>> files.
>> >>>>>>> now i am going through every thing to find out best technology
>> from
>> >>>>>> this.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Wed, Jul 10, 2013 at 12:52 AM, Chen, Pei
>> >>>>>>> <pei.c...@childrens.harvard.edu>wrote:
>> >>>>>>>
>> >>>>>>>> Hi Sandeep,
>> >>>>>>>> I am delighted to work with you on this project.
>> >>>>>>>>
>> >>>>>>>> I was not sure if I understood you correctly- did you mean to say
>> >>>>>> that
>> >>>>>>>> you
>> >>>>>>>> have already tried using cTAKES and it's components?
>> >>>>>>>> If not, you can do an svn checkout of the code and try running
>> >>>> the
>> >>>>>>>> debugger gui from the command line (or eclipseide) that will
>> >>>> allow
>> >>>>>> you
>> >>>>>>>> to
>> >>>>>>>> type in plain text and get back the different structured content
>> >>>>>> (types)
>> >>>>>>>> that cTAKES produces:
>> >>>>>>>> MAVEN_OPTS="-Xmx2g -Xms1g"
>> >>>>>>>> mvn -PrunCVD compile
>> >>>>>>>> From the guide:
>> >>>>
>> >>>>
>> https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+Developer+
>> >>>> I
>> >>>>>>>> nstall+Guide
>> >>>>>>>>
>> >>>>>>>> A bit of background:
>> >>>>>>>> Apache cTAKES uses SVN for version on control:
>> >>>>>>>> https://svn.apache.org/repos/asf/ctakes/trunk/
>> >>>>>>>> Jira for issues tracking:
>> >>>>>>>> https://issues.apache.org/jira/browse/ctakes
>> >>>>>>>> Maven for building and dependency management.
>> >>>>>>>> A lot of the developers use Eclipse IDE for their development.
>> >>>>>>>> More info on ctakes.apache.org
>> >>>>>>>>
>> >>>>>>>> cTAKES is built on top of the Apache UIMA Framework.
>> >>>> Essentially,
>> >>>>>>>> cTAKES
>> >>>>>>>> is a collection of Annotators (Java Classes) and wired together
>> >>>> to
>> >>>>>> into
>> >>>>>>>> a
>> >>>>>>>> pipeline.
>> >>>>>>>> It's goal in a nutshell is to turn unstructured plain text into
>> >>>>>>>> structured/normalized form and specially trained for medical
>> >>>> notes.
>> >>>>>>>> Right now- the input cTAKES expects would be in plain text form
>> >>>> and
>> >>>>>>>> cTAKES
>> >>>>>>>> does not have an OCR component.
>> >>>>>>>> cTAKE-189:GSoC:implement OCR/tika to standardize text inputs was
>> >>>> an
>> >>>>>> idea
>> >>>>>>>> to allow cTAKES to take in any type of input (PDF, Images, Word,
>> >>>> XLS,
>> >>>>>>>> etc.)
>> >>>>>>>> and pass the text for cTAKES processing.
>> >>>>>>>> [I was originally thinking this could be done in some kind of
>> >>>>>>>> preprocessing, or an optional Annotator that could be added in
>> >>>> the
>> >>>>>>>> beginning of a pipeline].  There may be some existing work that
>> >>>>>> could be
>> >>>>>>>> potentially reused: Apache Tika (
>> >>>>>>>> https://issues.apache.org/jira/browse/TIKA-93 ) as well as some
>> >>>> open
>> >>>>>>>> source OCR toolkits (JavaOCR).
>> >>>>>>>>
>> >>>>>>>> About Me:
>> >>>>
>> >>>>
>> http://childrenshospital.org/cfapps/research/data_admin/Site3240/mainpag
>> >>>> e
>> >>>>>>>> S3240P8.html
>> >>>>>>>> http://www.linkedin.com/in/peistation
>> >>>>>>>> http://people.apache.org/committer-index.html#chenpei
>> >>>>>>>>
>> >>>>>>>>> -----Original Message-----
>> >>>>>>>>> From: sandeep rg [mailto:sandeep.f...@gmail.com]
>> >>>>>>>>> Sent: Tuesday, July 09, 2013 1:19 PM
>> >>>>>>>>> To: dev@ctakes.apache.org
>> >>>>>>>>> Subject: Re: to involve in your development group
>> >>>>>>>>>
>> >>>>>>>>> Thanks a lot for giving me support.i like to work with you.
>> >>>>>>>>>
>> >>>>>>>>> I have gone through the objectives of the software,used the
>> >>>>>> software
>> >>>>>>>> and
>> >>>>>>>>> gone through various components of the project.can you provide
>> >>>> me
>> >>>>>>>> starting
>> >>>>>>>>> point from where i should start to know more about the coding
>> >>>> part
>> >>>>>> of
>> >>>>>>>> the
>> >>>>>>>>> project.
>> >>>>>>>>>
>> >>>>>>>>> can you tell me more about the project and about you also?
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Tue, Jul 9, 2013 at 1:14 AM, Chen, Pei
>> >>>>>>>>> <pei.c...@childrens.harvard.edu>wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hi Sandeep,
>> >>>>>>>>>> Thank you for the interest.  I just had a quick look at the
>> >>>>>> ICFOSS
>> >>>>>>>>>> pilot mentoring program and will be happy to serve as a
>> >>>> mentor
>> >>>>>> for
>> >>>>>>>>>> your project
>> >>>>>>>>>> proposal(s) if you are interested.
>> >>>>>>>>>>
>> >>>>>>>>>> --Pei
>> >>>>>>>>>>
>> >>>>>>>>>>> -----Original Message-----
>> >>>>>>>>>>> From: sandeep rg [mailto:sandeep.f...@gmail.com]
>> >>>>>>>>>>> Sent: Monday, July 08, 2013 2:24 PM
>> >>>>>>>>>>> To: dev@ctakes.apache.org
>> >>>>>>>>>>> Subject: Re: to involve in your development group
>> >>>>>>>>>>>
>> >>>>>>>>>>> sir,
>> >>>>>>>>>>>
>> >>>>>>>>>>> details of the program Pilot mentoring programme with india
>> >>>>>> ICFOSS
>> >>>>>>>>>>> is
>> >>>>>>>>>> given
>> >>>>>>>>>>> in the below web address
>> >>>>>> http://community.apache.org/mentoringprogramme-icfoss-pilot.html
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I am new to this community so i need a mentor for the
>> >>>>>> project.It
>> >>>>>>>>>>> will be
>> >>>>>>>>>> more
>> >>>>>>>>>>> helpful for me..
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Mon, Jul 8, 2013 at 7:22 PM, Chen, Pei
>> >>>>>>>>>>> <pei.c...@childrens.harvard.edu>wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Hi Sandeep,
>> >>>>>>>>>>>> Welcome!  I am not familiar with the details of
>> >>>>>> icfoss-apache,
>> >>>>>>>> but
>> >>>>>>>>>>>> please- you are more than welcome to work on the code and
>> >>>>>>>>>>>> contributions will be greatly appreciated!
>> >>>>>>>>>>>> There may be a learning curve, but feel free let us know
>> >>>> if
>> >>>>>> you
>> >>>>>>>>>>>> have any questions/issues.
>> >>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>> Pei
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> -----Original Message-----
>> >>>>>>>>>>>>> From: sandeep rg [mailto:sandeep.f...@gmail.com]
>> >>>>>>>>>>>>> Sent: Saturday, July 06, 2013 11:50 AM
>> >>>>>>>>>>>>> To: dev@ctakes.apache.org
>> >>>>>>>>>>>>> Subject: to involve in your development group
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> my name is sandeep.i am btech graduate.i had
>> >>>> participated
>> >>>>>> in
>> >>>>>>>> a
>> >>>>>>>>>>>>> camp coordinated in kerala,India in association with
>> >>>>>>>>>>>>> icfoss-apache called as
>> >>>>>>>>>>>> youth
>> >>>>>>>>>>>>> mentoring programme coordinated by Luciano resende.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>                                        i like the
>> >>>> project
>> >>>>>> and
>> >>>>>>>>>>>>> like to
>> >>>>>>>>>>>> involve in your project as a
>> >>>>>>>>>>>>> programmer.i have gone through the your project and
>> >>>> gone
>> >>>>>>>> through
>> >>>>>>>>>>>>> the bugs list.I like to work on the bug
>> >>>>>>>>>>>>> "cTAKE-189:GSoC:implement OCR/tika to standardize text
>> >>>>>> inputs
>> >>>>>>>>>>>>> for cTAKES".can you allow me to
>> >>>>>>>>>> work
>> >>>>>>>>>>> on that?
>> >
>>
>
>

Reply via email to