Hi - this is very helpful.  I would like you to officially create a ticket
in Jira: https://issues.apache.org/jira , project "CONNECTORS", and attach
these patches.  Backwards compatibility means that we very likely have to
use the hash approach, and not use the decoding approach.

Thanks,
Karl


On Mon, Mar 1, 2021 at 10:10 PM Shirai Takashi/ 白井隆 <shi...@nintendo.co.jp>
wrote:

> Hi, there.
>
> I've found another trouble in Elasticsearch connector.
> Elasticsearch output connector use the URI string as ID.
> Elasticsearch allows the length of ID no more than 512 bytes.
> If the URL length is too long, it causes HTTP 400 error.
>
> I prepare two solutions with this attached patch.
> The one is URI decoding.
> If the URI includes multibyte characters,
> the ID is URL encoded duplicately.
> Ex) U+3000 -> %E3%80%80 -> %25E3%2580%2580
> This enlarges the ID length unnecessarily.
> Then I add the option to decode URI as the ID before encoding.
>
> But the length may still longer than 512 bytes.
> The other solution is hashing.
> The new added options are the following.
> Raw) uses the URI string as is.
> Hash) hashes (SHA1) the URI string always.
> Hash if long) hashes the URI only if its length exceeds 512 bytes.
> The last one is prepared for the compatibility.
>
> Both of solutions cause a new problem.
> If the URI is decoded or hashed,
> the original URI cannot be keeped in each document.
> Then I add the new fields.
> URI field name) keeps the original URI string as is.
> Decoded URI field name) keeps the decoded URI string.
> The default settings provides these fields as empty.
>
>
> I sended the patch for Ingest-Attachment the other day.
> Then this mail attaches the two patches.
> apache-manifoldcf-2.18-elastic-id.patch.gz:
>  The patch for 2.18 including the patch of the other day.
> apache-manifoldcf-elastic-id.patch.gz:
>  The patch for the source patched the other day.
>
> By the way, I tryed to describe the above to some documents.
> But no suitable document is found in the ManifoldCF package.
> The Elasticsearch document may be wrote for the ancient spacifications.
> Where can I describe this new specifications?
>
> ----
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp

Reply via email to