Hi Chalitha, My comment was about encoding, not about languages. If you are assuming that the binary document stream is utf-8 (which will be the output of the Tika transformer), then you *must* specify utf-8 as the encoding when you convert it back to a string. Otherwise you will have data corruption.
Thanks, Karl On Thu, Nov 19, 2015 at 12:45 AM, chalitha udara Perera < chalithaud...@gmail.com> wrote: > Hi guys, > > Thank you very much for comments and suggestions ! > > As Alessandro said, I have assumed the use of Tika connector prior to using > the OpenNLP connector. > I think it is a valid assumption because tika parses different sources in > to common format, so the future > transformation connectors can largely benefit from the use of tika in the > connectors chain. > > Regarding the language issue, currently I implemented it to work with > English language content. > But I agree with Alessandro and connector can be made to support different > languages. Currently OpenNLP > has following models [1]. NER models are available for en, es and nl > languages. but it is possible to train models > for other languages as well. > > Tika can be used to detect language from document (As far as I know Stanbol > does that), If we assumed the use of tika connector before OpenNLP > connector, we can use language to direct to correct model. In this case we > have to download all the NER models and > reference them in code. > Please give your suggestions on how language support should be included in > the OpenNLP connector > > Thanks, > Chalitha > > [1] http://opennlp.sourceforge.net/models-1.5/ > > On Wed, Nov 18, 2015 at 9:21 PM, Karl Wright <daddy...@gmail.com> wrote: > > > There's another problem with: > > > > String textContent = new String(bytes); > > > > Specifically, (1) its operation will vary with the locale of the machine > > it's being run on, and (2) there's no limit to the amount of memory that > > this could conceivably require. Both are problems. If you could use a > > stream you would be much better off. > > > > Karl > > > > > > On Wed, Nov 18, 2015 at 9:20 AM, Alessandro Benedetti < > > abenede...@apache.org > > > wrote: > > > > > Hey Chal, > > > First of all thanks you very much for the contribution! > > > I have some observations : > > > > > > *Model Downloading* > > > > > > Taking the look to the way you provide the user with the models, I can > > see > > > there is a shell script to download very specific english models. > > > It would be great having the possibility to configure the model to use > in > > > the connector config UI . > > > In particular I see two possibilities : > > > 1) you provide a select list per model required and then automatically > > you > > > download the model and install it > > > 2) you provide the user with the possibility of uploading the model > > he/she > > > wants to use ( more flexible, but the user will need to download a > model > > on > > > his own) > > > In my opinion is really important to keep the transformation connector > > > flexible, able to work with different languages and models. > > > > > > *Text enrichment* > > > Taking a look to the code I see in here a really strong assumption : > > > > > > String textContent = new String(bytes); > > > > > > This means you assume the only input possible is plain text. > > > Actually as we know we have the binary there, not necessary a plain > > string. > > > I think we need to specify the Tika Transformer to be a requirement for > > > this connector. > > > Furthermore I would suggest the possibility for the user to select the > > list > > > of input fields to be considered to be the source of the extraction. > > > > > > e.g. > > > I can configure my extraction to happen from title,text and > description. > > > > > > Of course it is required a Transformer Connector to happen before the > > > OpenNLP one, to provide those fields. > > > These are quick considerations after a first look to the code, happy to > > > discuss and help further :) > > > > > > Cheers > > > > > > > > > > > > > > > On 18 November 2015 at 13:47, Karl Wright <daddy...@gmail.com> wrote: > > > > > > > Thanks, Chalitha, for contributing this! > > > > > > > > I hope to have a look at the code also, but it won't happen until > next > > > week > > > > I'm afraid. > > > > > > > > Karl > > > > > > > > > > > > On Wed, Nov 18, 2015 at 7:44 AM, Rafa Haro <rharoapa...@gmail.com> > > > wrote: > > > > > > > > > Hi Chalitha! > > > > > > > > > > > > > > > > > > > > > > > > > Awesome!. I will take a look to this as soon as possible. > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > Rafa > > > > > > > > > > On Wed, Nov 18, 2015 at 1:22 PM, chalitha udara Perera > > > > > <chalithaud...@gmail.com> wrote: > > > > > > > > > > > Hi All, > > > > > > I have worked on a OpenNLP based transformation connector for > some > > > > > > requirement. Given a document it extracts named entities such as > > > > people, > > > > > > locations and organisations and add those as metadata to > repository > > > > > > document. > > > > > > If you think this will be useful for the community, I would like > to > > > > > > contribute it to manifoldcf. > > > > > > Connector code is available here [1]. > > > > > > [1] https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector > > > > > > Thanks, > > > > > > Chalitha > > > > > > -- > > > > > > J.M Chalitha Udara Perera > > > > > > *Department of Computer Science and Engineering,* > > > > > > *University of Moratuwa,* > > > > > > *Sri Lanka* > > > > > > > > > > > > > > > > > > > > > -- > > > -------------------------- > > > > > > Benedetti Alessandro > > > Visiting card : http://about.me/alessandro_benedetti > > > > > > "Tyger, tyger burning bright > > > In the forests of the night, > > > What immortal hand or eye > > > Could frame thy fearful symmetry?" > > > > > > William Blake - Songs of Experience -1794 England > > > > > > > > > -- > J.M Chalitha Udara Perera > > *Department of Computer Science and Engineering,* > *University of Moratuwa,* > *Sri Lanka* >