Hi Karl, I will fix that encoding issue.
Thanks, Chalitha On Thu, Nov 19, 2015 at 12:31 PM, Karl Wright <[email protected]> wrote: > Hi Chalitha, > > My comment was about encoding, not about languages. If you are assuming > that the binary document stream is utf-8 (which will be the output of the > Tika transformer), then you *must* specify utf-8 as the encoding when you > convert it back to a string. Otherwise you will have data corruption. > > Thanks, > Karl > > > > > On Thu, Nov 19, 2015 at 12:45 AM, chalitha udara Perera < > [email protected]> wrote: > > > Hi guys, > > > > Thank you very much for comments and suggestions ! > > > > As Alessandro said, I have assumed the use of Tika connector prior to > using > > the OpenNLP connector. > > I think it is a valid assumption because tika parses different sources in > > to common format, so the future > > transformation connectors can largely benefit from the use of tika in the > > connectors chain. > > > > Regarding the language issue, currently I implemented it to work with > > English language content. > > But I agree with Alessandro and connector can be made to support > different > > languages. Currently OpenNLP > > has following models [1]. NER models are available for en, es and nl > > languages. but it is possible to train models > > for other languages as well. > > > > Tika can be used to detect language from document (As far as I know > Stanbol > > does that), If we assumed the use of tika connector before OpenNLP > > connector, we can use language to direct to correct model. In this case > we > > have to download all the NER models and > > reference them in code. > > Please give your suggestions on how language support should be included > in > > the OpenNLP connector > > > > Thanks, > > Chalitha > > > > [1] http://opennlp.sourceforge.net/models-1.5/ > > > > On Wed, Nov 18, 2015 at 9:21 PM, Karl Wright <[email protected]> wrote: > > > > > There's another problem with: > > > > > > String textContent = new String(bytes); > > > > > > Specifically, (1) its operation will vary with the locale of the > machine > > > it's being run on, and (2) there's no limit to the amount of memory > that > > > this could conceivably require. Both are problems. If you could use a > > > stream you would be much better off. > > > > > > Karl > > > > > > > > > On Wed, Nov 18, 2015 at 9:20 AM, Alessandro Benedetti < > > > [email protected] > > > > wrote: > > > > > > > Hey Chal, > > > > First of all thanks you very much for the contribution! > > > > I have some observations : > > > > > > > > *Model Downloading* > > > > > > > > Taking the look to the way you provide the user with the models, I > can > > > see > > > > there is a shell script to download very specific english models. > > > > It would be great having the possibility to configure the model to > use > > in > > > > the connector config UI . > > > > In particular I see two possibilities : > > > > 1) you provide a select list per model required and then > automatically > > > you > > > > download the model and install it > > > > 2) you provide the user with the possibility of uploading the model > > > he/she > > > > wants to use ( more flexible, but the user will need to download a > > model > > > on > > > > his own) > > > > In my opinion is really important to keep the transformation > connector > > > > flexible, able to work with different languages and models. > > > > > > > > *Text enrichment* > > > > Taking a look to the code I see in here a really strong assumption : > > > > > > > > String textContent = new String(bytes); > > > > > > > > This means you assume the only input possible is plain text. > > > > Actually as we know we have the binary there, not necessary a plain > > > string. > > > > I think we need to specify the Tika Transformer to be a requirement > for > > > > this connector. > > > > Furthermore I would suggest the possibility for the user to select > the > > > list > > > > of input fields to be considered to be the source of the extraction. > > > > > > > > e.g. > > > > I can configure my extraction to happen from title,text and > > description. > > > > > > > > Of course it is required a Transformer Connector to happen before the > > > > OpenNLP one, to provide those fields. > > > > These are quick considerations after a first look to the code, happy > to > > > > discuss and help further :) > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > On 18 November 2015 at 13:47, Karl Wright <[email protected]> > wrote: > > > > > > > > > Thanks, Chalitha, for contributing this! > > > > > > > > > > I hope to have a look at the code also, but it won't happen until > > next > > > > week > > > > > I'm afraid. > > > > > > > > > > Karl > > > > > > > > > > > > > > > On Wed, Nov 18, 2015 at 7:44 AM, Rafa Haro <[email protected]> > > > > wrote: > > > > > > > > > > > Hi Chalitha! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Awesome!. I will take a look to this as soon as possible. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Rafa > > > > > > > > > > > > On Wed, Nov 18, 2015 at 1:22 PM, chalitha udara Perera > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > Hi All, > > > > > > > I have worked on a OpenNLP based transformation connector for > > some > > > > > > > requirement. Given a document it extracts named entities such > as > > > > > people, > > > > > > > locations and organisations and add those as metadata to > > repository > > > > > > > document. > > > > > > > If you think this will be useful for the community, I would > like > > to > > > > > > > contribute it to manifoldcf. > > > > > > > Connector code is available here [1]. > > > > > > > [1] > https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector > > > > > > > Thanks, > > > > > > > Chalitha > > > > > > > -- > > > > > > > J.M Chalitha Udara Perera > > > > > > > *Department of Computer Science and Engineering,* > > > > > > > *University of Moratuwa,* > > > > > > > *Sri Lanka* > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -------------------------- > > > > > > > > Benedetti Alessandro > > > > Visiting card : http://about.me/alessandro_benedetti > > > > > > > > "Tyger, tyger burning bright > > > > In the forests of the night, > > > > What immortal hand or eye > > > > Could frame thy fearful symmetry?" > > > > > > > > William Blake - Songs of Experience -1794 England > > > > > > > > > > > > > > > -- > > J.M Chalitha Udara Perera > > > > *Department of Computer Science and Engineering,* > > *University of Moratuwa,* > > *Sri Lanka* > > > -- J.M Chalitha Udara Perera *Department of Computer Science and Engineering,* *University of Moratuwa,* *Sri Lanka*
