Re: issue for indexing large blobs into databases

Karl Wright Mon, 06 May 2019 04:03:59 -0700

It sounds like there might be an out-of-memory situation, although if that
is the case I would expect that the MCF agents process would just shut
itself down.

If the process is hung, it should be easy to get a thread dump.  That would
be the first step.

Some thoughts as to what might be happening: there might be an
out-of-memory condition that is being silently eaten somewhere.  MCF relies
on its connectors to use streams rather than load documents into memory,
BUT we're at the mercy of the JDBC drivers and Tika.  So another experiment
would be to try to crawl your documents without the tika extractor (just
send them to the null output connector) and see if that succeeds.  If it
does, then maybe try the external Tika Service connector instead and see
what happens then.

Karl

On Mon, May 6, 2019 at 5:56 AM Olivier Tavard <olivier.tav...@francelabs.com>
wrote:

> Hi MCF community,
>
> We have some issues with the JDBC connector to index database with large
> LONGBLOB into it : I mean files more than 100 MB.
> To reproduce the issue, I created a simple database on MySQL 5.7 with only
> one table into it with 2 columns : id (int)  and data (longblob).
> There are 8 records in the table : two records with very small files :
> less than 5 KB and 6 records with a large file of 192 MB.
>
> If I include in my crawl only one big BLOB, the crawl is successful.
> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id =6
>
> But if I include more than 2 of these (with a WHERE IN condition or if
> select all the records of the table) the job stays in running condition but
> no document is processed :
> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id IN (6,7)
>
>
>
> There is no message in the logs (even with debug mode activated).
> The tests were done in a Docker cluster on dedicated servers (4 cores, 32
> GB RAM each).
> Regarding MCF, the version is 2.12. The MCF agent has 1GB of RAM (xmx and
> xmx) the internal database is PostgreSQL 10.1 and we use Zookeeper based
> synchronization.
> We have a dump of the database here :
> https://www.datafari.com/files/mcftest.tar.gz
>
> In other cases, we also encounter this error :
> Error: Unexpected jobqueue status - record id 1555190422690, expecting
> active status, saw 0
>
> Any ideas of what the problem could be ? or at least why there is nothing
> in the debug log you think ?
>
> Thank you,
> Best regards,
>
> Olivier
>

Re: issue for indexing large blobs into databases

Reply via email to