There's another problem with:
String textContent = new String(bytes);
Specifically, (1) its operation will vary with the locale of the machine
it's being run on, and (2) there's no limit to the amount of memory that
this could conceivably require. Both are problems. If you could use a
stream
Hi Chalitha!
Awesome!. I will take a look to this as soon as possible.
Cheers,
Rafa
On Wed, Nov 18, 2015 at 1:22 PM, chalitha udara Perera
wrote:
> Hi All,
> I have worked on a OpenNLP based transformation connector for some
> requirement. Given a document it
Hi All,
I have worked on a OpenNLP based transformation connector for some
requirement. Given a document it extracts named entities such as people,
locations and organisations and add those as metadata to repository
document.
If you think this will be useful for the community, I would like to
Hi Chalitha,
first thank you so much for your work and I hope that some of us can take a
look at your project to understand if it can fits with the trunk of
ManifoldCF.
I hope to take a look at it today I think it is very interesting but I
would like to receive other feedback by the PMC.
Thank
Thanks, Chalitha, for contributing this!
I hope to have a look at the code also, but it won't happen until next week
I'm afraid.
Karl
On Wed, Nov 18, 2015 at 7:44 AM, Rafa Haro wrote:
> Hi Chalitha!
>
>
>
>
> Awesome!. I will take a look to this as soon as possible.
>
Hey Chal,
First of all thanks you very much for the contribution!
I have some observations :
*Model Downloading*
Taking the look to the way you provide the user with the models, I can see
there is a shell script to download very specific english models.
It would be great having the possibility
Hi guys,
Thank you very much for comments and suggestions !
As Alessandro said, I have assumed the use of Tika connector prior to using
the OpenNLP connector.
I think it is a valid assumption because tika parses different sources in
to common format, so the future
transformation connectors can
Hi Karl,
I will fix that encoding issue.
Thanks,
Chalitha
On Thu, Nov 19, 2015 at 12:31 PM, Karl Wright wrote:
> Hi Chalitha,
>
> My comment was about encoding, not about languages. If you are assuming
> that the binary document stream is utf-8 (which will be the output
Hi Chalitha,
My comment was about encoding, not about languages. If you are assuming
that the binary document stream is utf-8 (which will be the output of the
Tika transformer), then you *must* specify utf-8 as the encoding when you
convert it back to a string. Otherwise you will have data