Hi guys, Thank you very much for comments and suggestions !
As Alessandro said, I have assumed the use of Tika connector prior to using the OpenNLP connector. I think it is a valid assumption because tika parses different sources in to common format, so the future transformation connectors can largely benefit from the use of tika in the connectors chain. Regarding the language issue, currently I implemented it to work with English language content. But I agree with Alessandro and connector can be made to support different languages. Currently OpenNLP has following models [1]. NER models are available for en, es and nl languages. but it is possible to train models for other languages as well. Tika can be used to detect language from document (As far as I know Stanbol does that), If we assumed the use of tika connector before OpenNLP connector, we can use language to direct to correct model. In this case we have to download all the NER models and reference them in code. Please give your suggestions on how language support should be included in the OpenNLP connector Thanks, Chalitha [1] http://opennlp.sourceforge.net/models-1.5/ On Wed, Nov 18, 2015 at 9:21 PM, Karl Wright <[email protected]> wrote: > There's another problem with: > > String textContent = new String(bytes); > > Specifically, (1) its operation will vary with the locale of the machine > it's being run on, and (2) there's no limit to the amount of memory that > this could conceivably require. Both are problems. If you could use a > stream you would be much better off. > > Karl > > > On Wed, Nov 18, 2015 at 9:20 AM, Alessandro Benedetti < > [email protected] > > wrote: > > > Hey Chal, > > First of all thanks you very much for the contribution! > > I have some observations : > > > > *Model Downloading* > > > > Taking the look to the way you provide the user with the models, I can > see > > there is a shell script to download very specific english models. > > It would be great having the possibility to configure the model to use in > > the connector config UI . > > In particular I see two possibilities : > > 1) you provide a select list per model required and then automatically > you > > download the model and install it > > 2) you provide the user with the possibility of uploading the model > he/she > > wants to use ( more flexible, but the user will need to download a model > on > > his own) > > In my opinion is really important to keep the transformation connector > > flexible, able to work with different languages and models. > > > > *Text enrichment* > > Taking a look to the code I see in here a really strong assumption : > > > > String textContent = new String(bytes); > > > > This means you assume the only input possible is plain text. > > Actually as we know we have the binary there, not necessary a plain > string. > > I think we need to specify the Tika Transformer to be a requirement for > > this connector. > > Furthermore I would suggest the possibility for the user to select the > list > > of input fields to be considered to be the source of the extraction. > > > > e.g. > > I can configure my extraction to happen from title,text and description. > > > > Of course it is required a Transformer Connector to happen before the > > OpenNLP one, to provide those fields. > > These are quick considerations after a first look to the code, happy to > > discuss and help further :) > > > > Cheers > > > > > > > > > > On 18 November 2015 at 13:47, Karl Wright <[email protected]> wrote: > > > > > Thanks, Chalitha, for contributing this! > > > > > > I hope to have a look at the code also, but it won't happen until next > > week > > > I'm afraid. > > > > > > Karl > > > > > > > > > On Wed, Nov 18, 2015 at 7:44 AM, Rafa Haro <[email protected]> > > wrote: > > > > > > > Hi Chalitha! > > > > > > > > > > > > > > > > > > > > Awesome!. I will take a look to this as soon as possible. > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > Rafa > > > > > > > > On Wed, Nov 18, 2015 at 1:22 PM, chalitha udara Perera > > > > <[email protected]> wrote: > > > > > > > > > Hi All, > > > > > I have worked on a OpenNLP based transformation connector for some > > > > > requirement. Given a document it extracts named entities such as > > > people, > > > > > locations and organisations and add those as metadata to repository > > > > > document. > > > > > If you think this will be useful for the community, I would like to > > > > > contribute it to manifoldcf. > > > > > Connector code is available here [1]. > > > > > [1] https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector > > > > > Thanks, > > > > > Chalitha > > > > > -- > > > > > J.M Chalitha Udara Perera > > > > > *Department of Computer Science and Engineering,* > > > > > *University of Moratuwa,* > > > > > *Sri Lanka* > > > > > > > > > > > > > > > -- > > -------------------------- > > > > Benedetti Alessandro > > Visiting card : http://about.me/alessandro_benedetti > > > > "Tyger, tyger burning bright > > In the forests of the night, > > What immortal hand or eye > > Could frame thy fearful symmetry?" > > > > William Blake - Songs of Experience -1794 England > > > -- J.M Chalitha Udara Perera *Department of Computer Science and Engineering,* *University of Moratuwa,* *Sri Lanka*
