Re: Documents Out Of Scope and hop count

2023-09-26 Thread Marisol Redondo
Thanks a lot for the explanation, Karl, really useful.

I will wait for your reply at the end of the week, but I thought that the
main reason for the option "Reset seeding" was for that, for reevaluating
all pages, as a new fresh execution.


On Tue, 26 Sept 2023 at 13:30, Karl Wright  wrote:

> Okay, that is good to know.
> The hopcount assessment occurs when documents are added to the queue.
> Hopcounts are stored for each document in the hopcount table.  So if you
> change a hopcount limit, it is quite possible that nothing will change
> unless documents that are at the previous hopcount limit are re-evaluated.
> I believe there is no logic in ManifoldCF for that at this time, but I'd
> have to review the codebase to be certain of that.
>
> What that means is that you can't increase the hopcount limit and expect
> the next crawl to pick up the documents you excluded before with the
> hopcount mechanism.  Only when the documents need to be rescanned for some
> other reason would that happen as it stands now.  But I will get back to
> you after a review at the end of the week.
>
> Karl
>
> Karl
>
>
> On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo <
> marisol.redondo.gar...@gmail.com> wrote:
>
>> No, I haven't used this options, I have it configured as "Keep
>> unreachable documents, for now", but it's also ignoring them because they
>> were already kept?. With this option, when the unreachable document for now
>> are converted to forever?
>>
>> The only solution I can think on is creating a new job with the exact
>> same characteristics and run it.
>>
>> Regards and thanks
>>Marisol
>>
>>
>>
>> On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:
>>
>>> If you ever set "Ignore unreachable documents forever" for the job, you
>>> can't go back and stop ignoring them.  The data that the job would need to
>>> have recorded for this is gone.  The only way to get it back is if you can
>>> convince the ManifoldCF to recrawl all documents in the job.
>>>
>>>
>>> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
>>> marisol.redondo.gar...@gmail.com> wrote:
>>>
>>>>
>>>> Hi, I had a problem with document out of scope
>>>>
>>>> I change the Maximum hop count for type "redirect" in one of my job to
>>>> 5, and saw that the job is not processing some pages because of that, so I
>>>> removed the value to get them injecting into the output connector (Solr
>>>> connector)
>>>> After that, the same pages are still out of scope like the limit has
>>>> been set to 1, and they are not indexed.
>>>>
>>>> I have tried to "Reset seeding" thinking that maybe the pages need to
>>>> be check again, but still having the same problem, I don't think the
>>>> problem is with the output, but I have also use the option "Re-index all
>>>> associated documents" and "Remove all associated records" with the same
>>>> result
>>>> I don't want to clear the history in the repository, that it's a
>>>> website connector, as I don't want to lost all the history.
>>>>
>>>> Is this a bug in Manifold? Is there any option to fix this issue?
>>>>
>>>> I'm using Manifold version 2.24.
>>>>
>>>> Thanks
>>>> Marisol
>>>>
>>>>


Re: Documents Out Of Scope and hop count

2023-09-26 Thread Marisol Redondo
No, I haven't used this options, I have it configured as "Keep unreachable
documents, for now", but it's also ignoring them because they were already
kept?. With this option, when the unreachable document for now are
converted to forever?

The only solution I can think on is creating a new job with the exact same
characteristics and run it.

Regards and thanks
   Marisol



On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:

> If you ever set "Ignore unreachable documents forever" for the job, you
> can't go back and stop ignoring them.  The data that the job would need to
> have recorded for this is gone.  The only way to get it back is if you can
> convince the ManifoldCF to recrawl all documents in the job.
>
>
> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
> marisol.redondo.gar...@gmail.com> wrote:
>
>>
>> Hi, I had a problem with document out of scope
>>
>> I change the Maximum hop count for type "redirect" in one of my job to 5,
>> and saw that the job is not processing some pages because of that, so I
>> removed the value to get them injecting into the output connector (Solr
>> connector)
>> After that, the same pages are still out of scope like the limit has been
>> set to 1, and they are not indexed.
>>
>> I have tried to "Reset seeding" thinking that maybe the pages need to be
>> check again, but still having the same problem, I don't think the problem
>> is with the output, but I have also use the option "Re-index all associated
>> documents" and "Remove all associated records" with the same result
>> I don't want to clear the history in the repository, that it's a website
>> connector, as I don't want to lost all the history.
>>
>> Is this a bug in Manifold? Is there any option to fix this issue?
>>
>> I'm using Manifold version 2.24.
>>
>> Thanks
>> Marisol
>>
>>


Documents Out Of Scope and hop count

2023-09-26 Thread Marisol Redondo
Hi, I had a problem with document out of scope

I change the Maximum hop count for type "redirect" in one of my job to 5,
and saw that the job is not processing some pages because of that, so I
removed the value to get them injecting into the output connector (Solr
connector)
After that, the same pages are still out of scope like the limit has been
set to 1, and they are not indexed.

I have tried to "Reset seeding" thinking that maybe the pages need to be
check again, but still having the same problem, I don't think the problem
is with the output, but I have also use the option "Re-index all associated
documents" and "Remove all associated records" with the same result
I don't want to clear the history in the repository, that it's a website
connector, as I don't want to lost all the history.

Is this a bug in Manifold? Is there any option to fix this issue?

I'm using Manifold version 2.24.

Thanks
Marisol


Re: Duplicate key value violates unique constraint "repohistory_pkey"

2023-06-16 Thread Marisol Redondo
Hi,

Did you find any solution for that or do you have still disabled the
history?

I'm having the same problem, and we are using postgresql as the db.

Regards

On Sun, 29 Jan 2023 at 05:48, Artem Abeleshev 
wrote:

> Hi everyone!
>
> We are using ManifoldCF 2.22.1 with multiple nodes in our production. And
> I am investigating the problem we've got recently (it happens at least 5-6
> times already). Couple of our jobs are end up with the following error:
>
> ```
> Error: ERROR: duplicate key value violates unique constraint
> "repohistory_pkey" Detail: Key (id)=(1672652357009) already exists.
> ```
>
> and following log entry appears in the logs of the one of the nodes:
>
> ```
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: ERROR:
> duplicate key value violates unique constraint "repohistory_pkey"
>   Detail: Key (id)=(1673507409625) already exists.
> at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.reinterpretException(DBInterfacePostgreSQL.java:638)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:665)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performInsert(DBInterfacePostgreSQL.java:187)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.core.database.BaseTable.performInsert(BaseTable.java:68)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.crawler.repository.RepositoryHistoryManager.addRow(RepositoryHistoryManager.java:202)
> ~[mcf-pull-agent.jar:?]
> at
> org.apache.manifoldcf.crawler.repository.RepositoryConnectionManager.recordHistory(RepositoryConnectionManager.java:706)
> ~[mcf-pull-agent.jar:?]
> at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.recordActivity(WorkerThread.java:1878)
> ~[mcf-pull-agent.jar:?]
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocument(WebcrawlerConnector.java:1470)
> ~[?:?]
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:753)
> ~[?:?]
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:402)
> [mcf-pull-agent.jar:?]
> Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value
> violates unique constraint "repohistory_pkey"
>   Detail: Key (id)=(1673507409625) already exists.
> at
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2476)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2189)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:300)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428)
> ~[postgresql-42.1.3.jar:42.1.3]
> at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:169)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:136)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.apache.manifoldcf.core.database.Database.execute(Database.java:916)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:696)
> ~[mcf-core.jar:?]
> ```
>
> First, I have noticed that ID of the entities in the ManifoldCF are
> actualy a timestamps. So I become curious how it handles duplications and
> starting to dig the sources to get an idea of how an ids are generated. I
> found that ids are generated by the `IDFactory`
> (`org.apache.manifoldcf.core.interfaces.IDFactory`). `IDFactory` is using
> the id's pool. Each time we need a new id it will be extracted from the
> pool. In case of pool is empty `IDFactory` will generate another 100
> entries. To make sure ids are not overlapped the last generated id is
> stored in the zookeeper, so each time `IDFactory` will start generating
> next batch of ids, it will start from the last id generated. This part
> looks clean to me.
>
> Next investigation was concerning locking. It is obvious that during id
> generation we should handle synronization on the thread level (local jvm)
> and global level (zookeeper). Both global and local locking also looks fine.
>
> The other observation I made is that all cases happens during saving the
> repository history records. So the next idea was that probably the same
> record was trying to be stored repeatedly. But it seems it is quite hard to
> investigate this part as a lot of service layers can call this.
>
> For now I have just disabled history completely by placing
> `org.apache.manifoldcf.crawler.repository.store_history` propeprty with
> `false` 

Solr connector authentication issue

2023-06-07 Thread Marisol Redondo
Hi,

We are using Solr 8 with basic authentication, and when checking the output
connection I'm getting an Exception "Solr authorization failure, code 401:
aborting job"

The solr type is Solrcloud, as we have 3 server (installed in AWS
Kubernette containers), I have set the user ID and password in the Sever
tab and can connect to Zookeeper and solr, as, if I unchecked the option
"Bock anonymous request", the connector is working.

How can I make the connection working? I can't unchecked the "Block
anonymous request"
Am I missing any other configuration?
Is there any other place where I have to set the user and password?

Thanks
Marisol


Unreachable documents not deleted from solr

2017-09-14 Thread Marisol Redondo
Hi.

I'm using ManifoldCF 2.x (in one vm 2.5 and 2.6 in other) and crawling a
web site to index into solr 6.

I was thinking that when checking the check box "Delete unreachable
documents" in the "Hop Filters" tab of the job, all the documents indexed
in my solr instance that have been removed or moved will be deleted because
the job can't reach them, but we have checked that the documents are still
there and I haven't seen any document deletion in the repository history
(where I can see the injected documents).

Is there any bug in ManifoldCF? Should I change something else to remove
all the unreachable documents in the web site from the solr index?.

Thanks
Marisol


Re: UTF-8 Format from Confluence to Solr

2017-06-12 Thread Marisol Redondo
How can I do that?

On 1 June 2017 at 16:43, Antonio David Pérez Morales <
adperezmora...@gmail.com> wrote:

> Hi Marisol
>
> Could you mind to create a ticket and provide a patch?
>
> This way we can test it in our ends and include it for the next Manifold
> release.
>
> Thanks
>
> Regards
>
> 2017-06-01 16:28 GMT+02:00 Marisol Redondo <marisol.redondo.garcia@gmail.
> com>:
>
>> I fixed the problem.
>>
>> The problem is that the Confluence connector is getting the entity of the
>> request with the default encoding ("ISO-8859-1"), and not UTF-8.
>>
>> To fix that, I made a change in the Confluence connector, and each time
>> is reading the request's entity I use EntityUtils.toString(entity,
>> *"UTF-8"*)
>>
>> Thanks
>>
>>
>> On 31 May 2017 at 10:13, Marisol Redondo <marisol.redondo.garcia@gmail.
>> com> wrote:
>>
>>> Hi.
>>>
>>> I'm having problems with the encoding when injecting in Solr 6 in
>>> standalone mode from a Confluence wiki.
>>>
>>> I have Manifold 2.5 with Tomcat-8.
>>>
>>> The repository connector from the job take the information from a
>>> Confluence wiki and the output connector is Solr, using the Tika
>>> transformation, a custom transformation and a Metadata adjuster.
>>>
>>> When the document is injected into solr, the content of the document has
>>> some character that shouldn't be there because are not in the confluence
>>> page, mainly a  character.
>>>
>>> I have checked that confluence, the tomcat server when manifold is
>>> running, the http request to confluence has the Accept-Charset header set
>>> to UTF-8, the solr server is acepting UTF8.
>>>
>>> In the log, I have seen that when retrieving the information from
>>> confluence, the content is fine, and when it's sending the information to
>>> solr, it has the character. I have tried without using any transfomer and
>>> getting the same log entry.
>>>
>>> Is this a bug or how can I resolve this?
>>>
>>> Thanks for your help
>>>
>>>
>>>
>>>
>>>
>>
>


Re: UTF-8 Format from Confluence to Solr

2017-06-01 Thread Marisol Redondo
I fixed the problem.

The problem is that the Confluence connector is getting the entity of the
request with the default encoding ("ISO-8859-1"), and not UTF-8.

To fix that, I made a change in the Confluence connector, and each time is
reading the request's entity I use EntityUtils.toString(entity,*"UTF-8"*)

Thanks


On 31 May 2017 at 10:13, Marisol Redondo <marisol.redondo.gar...@gmail.com>
wrote:

> Hi.
>
> I'm having problems with the encoding when injecting in Solr 6 in
> standalone mode from a Confluence wiki.
>
> I have Manifold 2.5 with Tomcat-8.
>
> The repository connector from the job take the information from a
> Confluence wiki and the output connector is Solr, using the Tika
> transformation, a custom transformation and a Metadata adjuster.
>
> When the document is injected into solr, the content of the document has
> some character that shouldn't be there because are not in the confluence
> page, mainly a  character.
>
> I have checked that confluence, the tomcat server when manifold is
> running, the http request to confluence has the Accept-Charset header set
> to UTF-8, the solr server is acepting UTF8.
>
> In the log, I have seen that when retrieving the information from
> confluence, the content is fine, and when it's sending the information to
> solr, it has the character. I have tried without using any transfomer and
> getting the same log entry.
>
> Is this a bug or how can I resolve this?
>
> Thanks for your help
>
>
>
>
>


UTF-8 Format from Confluence to Solr

2017-05-31 Thread Marisol Redondo
Hi.

I'm having problems with the encoding when injecting in Solr 6 in
standalone mode from a Confluence wiki.

I have Manifold 2.5 with Tomcat-8.

The repository connector from the job take the information from a
Confluence wiki and the output connector is Solr, using the Tika
transformation, a custom transformation and a Metadata adjuster.

When the document is injected into solr, the content of the document has
some character that shouldn't be there because are not in the confluence
page, mainly a  character.

I have checked that confluence, the tomcat server when manifold is running,
the http request to confluence has the Accept-Charset header set to UTF-8,
the solr server is acepting UTF8.

In the log, I have seen that when retrieving the information from
confluence, the content is fine, and when it's sending the information to
solr, it has the character. I have tried without using any transfomer and
getting the same log entry.

Is this a bug or how can I resolve this?

Thanks for your help


Re: Metadata adjuster

2017-02-22 Thread Marisol Redondo
Hi  Karl and thank you for this quick answer.

I was reading the documentation of MCF 1.10 but I'm using MCF 2.5, sorry
for the confusion, and I think this version is compatible with solr6.
The pdf doesn't have any metadata or field called facetContentType, this is
because I'd been trying to use the Metadata Adjuster, to add a new
metadata/property to the doc so solr can index by this field when I'm
injecting the doc.
Should I use other transformation or is there any other way of duing it?
I am migrating from nutch to ManifoldCF and in nutch we can do it with
plugins, and I was thinking that the plugins in nutch are the same as the
transformation connectors in MCF

The completed error in solr is :

017-02-21 13:19:32.108 INFO  (qtp1854778591-18) [   x:sites]
> o.a.s.c.PluginBag Going to create a new requestHandler with {type =
> requestHandler,name = /update/extract,class =
> solr.extraction.ExtractingRequestHandler,args =
> {defaults={lowernames=true,fmap.
> meta=ignored_,fmap.content=_text_,update.chain=add-unknown-fields-to-the-schema,df=_text_}}}

2017-02-21 13:19:32.454 INFO  (qtp1854778591-18) [   x:sites]
> o.a.s.u.p.LogUpdateProcessorFactory [sites]  webapp=/solr path=/up

date/extract 
params={resource.name=introduction.pdf=https://./introduction.pdf=xml=2.2}{}
> 0 347

2017-02-21 13:19:32.455 ERROR (qtp1854778591-18) [   x:sites]
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: [

doc=https://../introduction.pdf] missing required field:
> facetContentType

at
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:197)

at
> org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)

at
> org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)

at
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)



Thanks


On 21 February 2017 at 14:52, Karl Wright <daddy...@gmail.com> wrote:

> Hi Marisol,
>
> Can you find the [INFO] entry in the Solr log for this document?  That
> should help clear up any confusion.
>
> Also, for what it is worth, MCF 1.10 is not using a SolrJ that is up to
> date with Solr 6.x.  That could be the source of the problem  Is there any
> reason you are using a 1.x version of MCF?
>
> Karl
>
>
> On Tue, Feb 21, 2017 at 8:42 AM, Marisol Redondo <
> marisol.redondo.gar...@gmail.com> wrote:
>
>> Hi.
>>
>> I'm trying to use metadata adjuster to add one field to the solr index
>> but doesn't inject the field into a solr's field.
>> Maybe I'm misundertaning the use of the metada adjuster, but I have read
>> in the documentation (https://manifoldcf.apache.org
>> /release/release-1.10/en_US/end-user-documentation.html) that I can add
>> metadata to the document that is going to be indexed into solr, but the
>> solr instance gave me the error "missing required field:
>> facetContentType".
>>
>> ManifoldCF Job pipeline:
>> 1. Repository (type web repository)
>> 2. Transformation (Tikka Metadata Extractor)
>> 3. Transformation (type Metada Adjuster)
>> 4. Output (Solr 6)
>>
>> ManifoldCF Job Metadata Expressions tab:
>>   Parameter name: "facetContentType"
>>   Remove this parameter: false
>>   Expresion:   (the literal text value I want in facetContentType)
>>
>> Solr schema:
>>   .
>>   > stored="true" required="true"/>
>>  
>>
>> The error logged in ManifoldCF is:
>>   Error from server at http://solrServer:port/solr/c
>> <http://revnetsolrdev:8983/solr/sites>ore: [doc=https://../index.aspx]
>> missing required field: facetContentType.
>>
>> Thanks for your help
>>
>
>


Metadata adjuster

2017-02-21 Thread Marisol Redondo
Hi.

I'm trying to use metadata adjuster to add one field to the solr index but
doesn't inject the field into a solr's field.
Maybe I'm misundertaning the use of the metada adjuster, but I have read in
the documentation (https://manifoldcf.apache.org/release/release-1.10/en_
US/end-user-documentation.html) that I can add metadata to the document
that is going to be indexed into solr, but the solr instance gave me the
error "missing required field: facetContentType".

ManifoldCF Job pipeline:
1. Repository (type web repository)
2. Transformation (Tikka Metadata Extractor)
3. Transformation (type Metada Adjuster)
4. Output (Solr 6)

ManifoldCF Job Metadata Expressions tab:
  Parameter name: "facetContentType"
  Remove this parameter: false
  Expresion:   (the literal text value I want in facetContentType)

Solr schema:
  .
  
 

The error logged in ManifoldCF is:
  Error from server at http://solrServer:port/solr/c
ore: [doc=https://../index.aspx]
missing required field: facetContentType.

Thanks for your help