Re: UTF-8 Format from Confluence to Solr

2017-06-01 Thread Antonio David Pérez Morales
Hi Marisol

Could you mind to create a ticket and provide a patch?

This way we can test it in our ends and include it for the next Manifold
release.

Thanks

Regards

2017-06-01 16:28 GMT+02:00 Marisol Redondo :

> I fixed the problem.
>
> The problem is that the Confluence connector is getting the entity of the
> request with the default encoding ("ISO-8859-1"), and not UTF-8.
>
> To fix that, I made a change in the Confluence connector, and each time is
> reading the request's entity I use EntityUtils.toString(entity,*"UTF-8"*)
>
> Thanks
>
>
> On 31 May 2017 at 10:13, Marisol Redondo  > wrote:
>
>> Hi.
>>
>> I'm having problems with the encoding when injecting in Solr 6 in
>> standalone mode from a Confluence wiki.
>>
>> I have Manifold 2.5 with Tomcat-8.
>>
>> The repository connector from the job take the information from a
>> Confluence wiki and the output connector is Solr, using the Tika
>> transformation, a custom transformation and a Metadata adjuster.
>>
>> When the document is injected into solr, the content of the document has
>> some character that shouldn't be there because are not in the confluence
>> page, mainly a  character.
>>
>> I have checked that confluence, the tomcat server when manifold is
>> running, the http request to confluence has the Accept-Charset header set
>> to UTF-8, the solr server is acepting UTF8.
>>
>> In the log, I have seen that when retrieving the information from
>> confluence, the content is fine, and when it's sending the information to
>> solr, it has the character. I have tried without using any transfomer and
>> getting the same log entry.
>>
>> Is this a bug or how can I resolve this?
>>
>> Thanks for your help
>>
>>
>>
>>
>>
>


Re: UTF-8 Format from Confluence to Solr

2017-06-01 Thread Marisol Redondo
I fixed the problem.

The problem is that the Confluence connector is getting the entity of the
request with the default encoding ("ISO-8859-1"), and not UTF-8.

To fix that, I made a change in the Confluence connector, and each time is
reading the request's entity I use EntityUtils.toString(entity,*"UTF-8"*)

Thanks


On 31 May 2017 at 10:13, Marisol Redondo 
wrote:

> Hi.
>
> I'm having problems with the encoding when injecting in Solr 6 in
> standalone mode from a Confluence wiki.
>
> I have Manifold 2.5 with Tomcat-8.
>
> The repository connector from the job take the information from a
> Confluence wiki and the output connector is Solr, using the Tika
> transformation, a custom transformation and a Metadata adjuster.
>
> When the document is injected into solr, the content of the document has
> some character that shouldn't be there because are not in the confluence
> page, mainly a  character.
>
> I have checked that confluence, the tomcat server when manifold is
> running, the http request to confluence has the Accept-Charset header set
> to UTF-8, the solr server is acepting UTF8.
>
> In the log, I have seen that when retrieving the information from
> confluence, the content is fine, and when it's sending the information to
> solr, it has the character. I have tried without using any transfomer and
> getting the same log entry.
>
> Is this a bug or how can I resolve this?
>
> Thanks for your help
>
>
>
>
>