This new fix seems to work. Ingestions and deletions are working and the image file with huge metadata is indexed !

Julien


On 25/09/2018 13:59, Karl Wright wrote:
I've committed a hack to trunk.  It has been tested for Solr Cell
documents, deletions, and for tika-connector-extracted documents that don't
have a lot of metadata.  I'm asking Julien to test it with his specific
image that has lots of metadata to see if the pathway for that case works
properly.  If it does, I'll spin another RC.

Long term, since I'm a Lucene/Solr committer, I think I'm going to have to
take SolrJ under my wing if we expect it to work for ManifoldCF.  I don't
have a lot of time to do stuff like this anymore but clearly neither does
the Solr team.

Karl


On Tue, Sep 25, 2018 at 6:14 AM Karl Wright <[email protected]> wrote:

The back-and-forth is not going well.  Mr. Noble is needing to be
convinced that it is a valid use case for Solr to have metadata longer than
4096 characters.  In fact it seems like the Solr folks have deliberately
been trying to get rid of support for multipart posts for a while, because
they don't see the need for them.  I'm still hoping to convince them
otherwise but I'm not getting a positive feel.

I'm still trying to figure out if multipart posts have any fundamental
conflict with their RequestWriter architecture.  If not I can perhaps
override the RequestWrite implementation and add multipart support that
way.  But it's not going to be a quick process by any means.


On Mon, Sep 24, 2018 at 12:13 PM Karl Wright <[email protected]> wrote:

Hi Julien,

This has nothing to do with the new Tika.

It is not normal; it means that UpdateRequests are not being sent as
multipart form posts.  It's going to require work from the Solr team to fix
this problem, however, because everything I do to work around the issue
nonetheless seems to fail. :-(

I'm having a back-and-forth with Paul Noble right now.  I'll update
accordingly when I know more.

Karl


On Mon, Sep 24, 2018 at 11:33 AM Julien Massiera <
[email protected]> wrote:

After testing it, it is a +1 for me

However, I found a new interesting issue coming with the new Tika
version. I had a jpg file for which some metadata were not extracted
before, like the RedTRC, BlueTRC and GreenTRC which contain
approximatively 2048 bytes of data each. As the metadata are passed to
Solr through the URI, I get the following error : URI is too large >8192

Do we consider it as a "normal issue" or is it worth checking the
metadata length before sending the ingest request ?


On 24/09/2018 16:43, Karl Wright wrote:
Please vote on whether to release ManifoldCF 2.11, RC3.  This release
contains a number of fixes/improvements/additions, described in the
CHANGES.txt file.  In addition, it includes Tika 1.19, which has a
number
of fixes for classpath issues specifically requested by ManifoldCF.

This completely fixes a SolrJ related problem with the Solr Connector
found
in RC3.  All tests pass.

The release artifact can be found at:


https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.11
There is also a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.11-RC3

Thanks again,
Karl Wright

--
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Retrouvez-nous à l’Enterprise Search & Discovery Summit à Washington DC
www.francelabs.com



--
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Retrouvez-nous à l’Enterprise Search & Discovery Summit à Washington DC
www.francelabs.com

Reply via email to