subject:"Re\:"

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright

No, only the seed URLs get updated with that option.


On Tue, Sep 26, 2023 at 10:09 AM Marisol Redondo <
marisol.redondo.gar...@gmail.com> wrote:

> Thanks a lot for the explanation, Karl, really useful.
>
> I will wait for your reply at the end of the week, but I thought that the
> main reason for the option "Reset seeding" was for that, for reevaluating
> all pages, as a new fresh execution.
>
>
> On Tue, 26 Sept 2023 at 13:30, Karl Wright  wrote:
>
>> Okay, that is good to know.
>> The hopcount assessment occurs when documents are added to the queue.
>> Hopcounts are stored for each document in the hopcount table.  So if you
>> change a hopcount limit, it is quite possible that nothing will change
>> unless documents that are at the previous hopcount limit are re-evaluated.
>> I believe there is no logic in ManifoldCF for that at this time, but I'd
>> have to review the codebase to be certain of that.
>>
>> What that means is that you can't increase the hopcount limit and expect
>> the next crawl to pick up the documents you excluded before with the
>> hopcount mechanism.  Only when the documents need to be rescanned for some
>> other reason would that happen as it stands now.  But I will get back to
>> you after a review at the end of the week.
>>
>> Karl
>>
>> Karl
>>
>>
>> On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo <
>> marisol.redondo.gar...@gmail.com> wrote:
>>
>>> No, I haven't used this options, I have it configured as "Keep
>>> unreachable documents, for now", but it's also ignoring them because they
>>> were already kept?. With this option, when the unreachable document for now
>>> are converted to forever?
>>>
>>> The only solution I can think on is creating a new job with the exact
>>> same characteristics and run it.
>>>
>>> Regards and thanks
>>>Marisol
>>>
>>>
>>>
>>> On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:
>>>
>>>> If you ever set "Ignore unreachable documents forever" for the job, you
>>>> can't go back and stop ignoring them.  The data that the job would need to
>>>> have recorded for this is gone.  The only way to get it back is if you can
>>>> convince the ManifoldCF to recrawl all documents in the job.
>>>>
>>>>
>>>> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
>>>> marisol.redondo.gar...@gmail.com> wrote:
>>>>
>>>>>
>>>>> Hi, I had a problem with document out of scope
>>>>>
>>>>> I change the Maximum hop count for type "redirect" in one of my job to
>>>>> 5, and saw that the job is not processing some pages because of that, so I
>>>>> removed the value to get them injecting into the output connector (Solr
>>>>> connector)
>>>>> After that, the same pages are still out of scope like the limit has
>>>>> been set to 1, and they are not indexed.
>>>>>
>>>>> I have tried to "Reset seeding" thinking that maybe the pages need to
>>>>> be check again, but still having the same problem, I don't think the
>>>>> problem is with the output, but I have also use the option "Re-index all
>>>>> associated documents" and "Remove all associated records" with the same
>>>>> result
>>>>> I don't want to clear the history in the repository, that it's a
>>>>> website connector, as I don't want to lost all the history.
>>>>>
>>>>> Is this a bug in Manifold? Is there any option to fix this issue?
>>>>>
>>>>> I'm using Manifold version 2.24.
>>>>>
>>>>> Thanks
>>>>> Marisol
>>>>>
>>>>>

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Marisol Redondo

Thanks a lot for the explanation, Karl, really useful.

I will wait for your reply at the end of the week, but I thought that the
main reason for the option "Reset seeding" was for that, for reevaluating
all pages, as a new fresh execution.


On Tue, 26 Sept 2023 at 13:30, Karl Wright  wrote:

> Okay, that is good to know.
> The hopcount assessment occurs when documents are added to the queue.
> Hopcounts are stored for each document in the hopcount table.  So if you
> change a hopcount limit, it is quite possible that nothing will change
> unless documents that are at the previous hopcount limit are re-evaluated.
> I believe there is no logic in ManifoldCF for that at this time, but I'd
> have to review the codebase to be certain of that.
>
> What that means is that you can't increase the hopcount limit and expect
> the next crawl to pick up the documents you excluded before with the
> hopcount mechanism.  Only when the documents need to be rescanned for some
> other reason would that happen as it stands now.  But I will get back to
> you after a review at the end of the week.
>
> Karl
>
> Karl
>
>
> On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo <
> marisol.redondo.gar...@gmail.com> wrote:
>
>> No, I haven't used this options, I have it configured as "Keep
>> unreachable documents, for now", but it's also ignoring them because they
>> were already kept?. With this option, when the unreachable document for now
>> are converted to forever?
>>
>> The only solution I can think on is creating a new job with the exact
>> same characteristics and run it.
>>
>> Regards and thanks
>>Marisol
>>
>>
>>
>> On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:
>>
>>> If you ever set "Ignore unreachable documents forever" for the job, you
>>> can't go back and stop ignoring them.  The data that the job would need to
>>> have recorded for this is gone.  The only way to get it back is if you can
>>> convince the ManifoldCF to recrawl all documents in the job.
>>>
>>>
>>> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
>>> marisol.redondo.gar...@gmail.com> wrote:
>>>
>>>>
>>>> Hi, I had a problem with document out of scope
>>>>
>>>> I change the Maximum hop count for type "redirect" in one of my job to
>>>> 5, and saw that the job is not processing some pages because of that, so I
>>>> removed the value to get them injecting into the output connector (Solr
>>>> connector)
>>>> After that, the same pages are still out of scope like the limit has
>>>> been set to 1, and they are not indexed.
>>>>
>>>> I have tried to "Reset seeding" thinking that maybe the pages need to
>>>> be check again, but still having the same problem, I don't think the
>>>> problem is with the output, but I have also use the option "Re-index all
>>>> associated documents" and "Remove all associated records" with the same
>>>> result
>>>> I don't want to clear the history in the repository, that it's a
>>>> website connector, as I don't want to lost all the history.
>>>>
>>>> Is this a bug in Manifold? Is there any option to fix this issue?
>>>>
>>>> I'm using Manifold version 2.24.
>>>>
>>>> Thanks
>>>> Marisol
>>>>
>>>>

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright

Okay, that is good to know.
The hopcount assessment occurs when documents are added to the queue.
Hopcounts are stored for each document in the hopcount table.  So if you
change a hopcount limit, it is quite possible that nothing will change
unless documents that are at the previous hopcount limit are re-evaluated.
I believe there is no logic in ManifoldCF for that at this time, but I'd
have to review the codebase to be certain of that.

What that means is that you can't increase the hopcount limit and expect
the next crawl to pick up the documents you excluded before with the
hopcount mechanism.  Only when the documents need to be rescanned for some
other reason would that happen as it stands now.  But I will get back to
you after a review at the end of the week.

Karl

Karl


On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo <
marisol.redondo.gar...@gmail.com> wrote:

> No, I haven't used this options, I have it configured as "Keep unreachable
> documents, for now", but it's also ignoring them because they were already
> kept?. With this option, when the unreachable document for now are
> converted to forever?
>
> The only solution I can think on is creating a new job with the exact same
> characteristics and run it.
>
> Regards and thanks
>Marisol
>
>
>
> On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:
>
>> If you ever set "Ignore unreachable documents forever" for the job, you
>> can't go back and stop ignoring them.  The data that the job would need to
>> have recorded for this is gone.  The only way to get it back is if you can
>> convince the ManifoldCF to recrawl all documents in the job.
>>
>>
>> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
>> marisol.redondo.gar...@gmail.com> wrote:
>>
>>>
>>> Hi, I had a problem with document out of scope
>>>
>>> I change the Maximum hop count for type "redirect" in one of my job to
>>> 5, and saw that the job is not processing some pages because of that, so I
>>> removed the value to get them injecting into the output connector (Solr
>>> connector)
>>> After that, the same pages are still out of scope like the limit has
>>> been set to 1, and they are not indexed.
>>>
>>> I have tried to "Reset seeding" thinking that maybe the pages need to be
>>> check again, but still having the same problem, I don't think the problem
>>> is with the output, but I have also use the option "Re-index all associated
>>> documents" and "Remove all associated records" with the same result
>>> I don't want to clear the history in the repository, that it's a website
>>> connector, as I don't want to lost all the history.
>>>
>>> Is this a bug in Manifold? Is there any option to fix this issue?
>>>
>>> I'm using Manifold version 2.24.
>>>
>>> Thanks
>>> Marisol
>>>
>>>

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Marisol Redondo

No, I haven't used this options, I have it configured as "Keep unreachable
documents, for now", but it's also ignoring them because they were already
kept?. With this option, when the unreachable document for now are
converted to forever?

The only solution I can think on is creating a new job with the exact same
characteristics and run it.

Regards and thanks
   Marisol



On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:

> If you ever set "Ignore unreachable documents forever" for the job, you
> can't go back and stop ignoring them.  The data that the job would need to
> have recorded for this is gone.  The only way to get it back is if you can
> convince the ManifoldCF to recrawl all documents in the job.
>
>
> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
> marisol.redondo.gar...@gmail.com> wrote:
>
>>
>> Hi, I had a problem with document out of scope
>>
>> I change the Maximum hop count for type "redirect" in one of my job to 5,
>> and saw that the job is not processing some pages because of that, so I
>> removed the value to get them injecting into the output connector (Solr
>> connector)
>> After that, the same pages are still out of scope like the limit has been
>> set to 1, and they are not indexed.
>>
>> I have tried to "Reset seeding" thinking that maybe the pages need to be
>> check again, but still having the same problem, I don't think the problem
>> is with the output, but I have also use the option "Re-index all associated
>> documents" and "Remove all associated records" with the same result
>> I don't want to clear the history in the repository, that it's a website
>> connector, as I don't want to lost all the history.
>>
>> Is this a bug in Manifold? Is there any option to fix this issue?
>>
>> I'm using Manifold version 2.24.
>>
>> Thanks
>> Marisol
>>
>>

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright

If you ever set "Ignore unreachable documents forever" for the job, you
can't go back and stop ignoring them.  The data that the job would need to
have recorded for this is gone.  The only way to get it back is if you can
convince the ManifoldCF to recrawl all documents in the job.


On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
marisol.redondo.gar...@gmail.com> wrote:

>
> Hi, I had a problem with document out of scope
>
> I change the Maximum hop count for type "redirect" in one of my job to 5,
> and saw that the job is not processing some pages because of that, so I
> removed the value to get them injecting into the output connector (Solr
> connector)
> After that, the same pages are still out of scope like the limit has been
> set to 1, and they are not indexed.
>
> I have tried to "Reset seeding" thinking that maybe the pages need to be
> check again, but still having the same problem, I don't think the problem
> is with the output, but I have also use the option "Re-index all associated
> documents" and "Remove all associated records" with the same result
> I don't want to clear the history in the repository, that it's a website
> connector, as I don't want to lost all the history.
>
> Is this a bug in Manifold? Is there any option to fix this issue?
>
> I'm using Manifold version 2.24.
>
> Thanks
> Marisol
>
>

Re: web crawler https

2023-09-25 Thread Karl Wright

See this article:

https://stackoverflow.com/questions/6784463/error-trustanchors-parameter-must-be-non-empty

ManifoldCF web crawler configuration allows you to drop certs into a local
trust store for the connection.  You need to either do that (adding
whatever certificate authority cert you think might be missing), or by
checking the "trust https" checkbox.

You can generally debug what certs a site might need by trying to fetch a
page with curl and using verbose debug mode.

Karl

On Mon, Sep 25, 2023 at 10:48 AM Bisonti Mario 
wrote:

> Hi,
>
> I would like to try indexing a Wordpress internal site.
>
> I tried to configure Repository Web, Job with seeds but I always obtain:
>
>
>
> WARN 2023-09-25T16:31:50,905 (Worker thread '4') - Service interruption
> reported for job 1695649924581 connection 'Wp': IO exception
> (javax.net.ssl.SSLException)reading header: Unexpected error:
> java.security.InvalidAlgorithmParameterException: the trustAnchors
> parameter must be non-empty
>
>
>
> How could I solve?
>
> Thanks a lot
>
> Mario
>
>

Re: Duplicate key value violates unique constraint "repohistory_pkey"

2023-06-16 Thread Marisol Redondo

Hi,

Did you find any solution for that or do you have still disabled the
history?

I'm having the same problem, and we are using postgresql as the db.

Regards

On Sun, 29 Jan 2023 at 05:48, Artem Abeleshev 
wrote:

> Hi everyone!
>
> We are using ManifoldCF 2.22.1 with multiple nodes in our production. And
> I am investigating the problem we've got recently (it happens at least 5-6
> times already). Couple of our jobs are end up with the following error:
>
> ```
> Error: ERROR: duplicate key value violates unique constraint
> "repohistory_pkey" Detail: Key (id)=(1672652357009) already exists.
> ```
>
> and following log entry appears in the logs of the one of the nodes:
>
> ```
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: ERROR:
> duplicate key value violates unique constraint "repohistory_pkey"
>   Detail: Key (id)=(1673507409625) already exists.
> at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.reinterpretException(DBInterfacePostgreSQL.java:638)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:665)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performInsert(DBInterfacePostgreSQL.java:187)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.core.database.BaseTable.performInsert(BaseTable.java:68)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.crawler.repository.RepositoryHistoryManager.addRow(RepositoryHistoryManager.java:202)
> ~[mcf-pull-agent.jar:?]
> at
> org.apache.manifoldcf.crawler.repository.RepositoryConnectionManager.recordHistory(RepositoryConnectionManager.java:706)
> ~[mcf-pull-agent.jar:?]
> at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.recordActivity(WorkerThread.java:1878)
> ~[mcf-pull-agent.jar:?]
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocument(WebcrawlerConnector.java:1470)
> ~[?:?]
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:753)
> ~[?:?]
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:402)
> [mcf-pull-agent.jar:?]
> Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value
> violates unique constraint "repohistory_pkey"
>   Detail: Key (id)=(1673507409625) already exists.
> at
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2476)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2189)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:300)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428)
> ~[postgresql-42.1.3.jar:42.1.3]
> at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:169)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:136)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.apache.manifoldcf.core.database.Database.execute(Database.java:916)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:696)
> ~[mcf-core.jar:?]
> ```
>
> First, I have noticed that ID of the entities in the ManifoldCF are
> actualy a timestamps. So I become curious how it handles duplications and
> starting to dig the sources to get an idea of how an ids are generated. I
> found that ids are generated by the `IDFactory`
> (`org.apache.manifoldcf.core.interfaces.IDFactory`). `IDFactory` is using
> the id's pool. Each time we need a new id it will be extracted from the
> pool. In case of pool is empty `IDFactory` will generate another 100
> entries. To make sure ids are not overlapped the last generated id is
> stored in the zookeeper, so each time `IDFactory` will start generating
> next batch of ids, it will start from the last id generated. This part
> looks clean to me.
>
> Next investigation was concerning locking. It is obvious that during id
> generation we should handle synronization on the thread level (local jvm)
> and global level (zookeeper). Both global and local locking also looks fine.
>
> The other observation I made is that all cases happens during saving the
> repository history records. So the next idea was that probably the same
> record was trying to be stored repeatedly. But it seems it is quite hard to
> investigate this part as a lot of service layers can call this.
>
> For now I have just disabled history completely by placing
> `org.apache.manifoldcf.crawler.repository.store_history` propeprty with
> `false`

Re: Solr connector authentication issue

2023-06-07 Thread Karl Wright

But if those are set, and the connection health check passes, then I can't
tell you why Solr is unhappy with your connection.  It's clearly working
sometimes.  I'd look on the Solr end to figure out whether its rejection is
coming from just one of your instances.



On Wed, Jun 7, 2023 at 7:49 AM Karl Wright  wrote:

> The Solr output connection configuration contains all credentials that are
> sent to Solr.  If those aren't set Solr won't get them.
>
> Karl
>
>
> On Wed, Jun 7, 2023 at 7:23 AM Marisol Redondo <
> marisol.redondo.gar...@gmail.com> wrote:
>
>> Hi,
>>
>> We are using Solr 8 with basic authentication, and when checking the
>> output connection I'm getting an Exception "Solr authorization failure,
>> code 401: aborting job"
>>
>> The solr type is Solrcloud, as we have 3 server (installed in AWS
>> Kubernette containers), I have set the user ID and password in the Sever
>> tab and can connect to Zookeeper and solr, as, if I unchecked the option
>> "Bock anonymous request", the connector is working.
>>
>> How can I make the connection working? I can't unchecked the "Block
>> anonymous request"
>> Am I missing any other configuration?
>> Is there any other place where I have to set the user and password?
>>
>> Thanks
>> Marisol
>>
>>

Re: Solr connector authentication issue

2023-06-07 Thread Karl Wright

The Solr output connection configuration contains all credentials that are
sent to Solr.  If those aren't set Solr won't get them.

Karl


On Wed, Jun 7, 2023 at 7:23 AM Marisol Redondo <
marisol.redondo.gar...@gmail.com> wrote:

> Hi,
>
> We are using Solr 8 with basic authentication, and when checking the
> output connection I'm getting an Exception "Solr authorization failure,
> code 401: aborting job"
>
> The solr type is Solrcloud, as we have 3 server (installed in AWS
> Kubernette containers), I have set the user ID and password in the Sever
> tab and can connect to Zookeeper and solr, as, if I unchecked the option
> "Bock anonymous request", the connector is working.
>
> How can I make the connection working? I can't unchecked the "Block
> anonymous request"
> Am I missing any other configuration?
> Is there any other place where I have to set the user and password?
>
> Thanks
> Marisol
>
>

Re: Long Job on Windows Share

2023-05-25 Thread Karl Wright

The jcifs connector does not include a lot of information in the version
string for a file - basically, the length, and the modified date.  So I
would not expect there to be lot of actual work involved if there are no
changes to a document.

The activity "access" does imply that the system believes that the document
does need to be reindexed.  It clearly reads the document properly.  I
would check to be sure it actually indexes the document.  I suspect that
your job may be reading the file but determining it is not suitable for
indexing and then repeating that every day.  You can see this by looking
for the document in the activity log to see what ManifoldCF decided to do
with it.

Karl

On Thu, May 25, 2023 at 6:03 AM Bisonti Mario 
wrote:

> Hi,
>
> I would like to understand how recrawl works
>
>
>
> My job scan, using “Connection Type”  “Windows shares” works for near 18
> hours.
>
> My document numebr a little bit of 1 million.
>
>
>
> If I check the documents scan from MifoldCF I see, for example:
>
>
>
> It seems that re work on the document every day even if it hadn’t been
> modified.
>
> So, is it right or I chose a wrong job to crawl the documents?
>
>
>
> Thanks a lot
>
> Mario
>
>
>
>
>

Re: Apache Manifold Documentum connector

2023-03-17 Thread Rasťa Šíša

Thanks a lot for your kind and elaborate response!
I will do some further investigation on my own towards the documentum.
Best regards,
Rasta

pá 17. 3. 2023 v 12:08 odesílatel Karl Wright  napsal:

> It was open-sourced back in 2012 at the same time ManifoldCF was
> open-sourced.  It was written by a contractor paid by MetaCarta, who also
> paid for the development of ManifoldCF itself (I developed that).  It was
> spun off as open source when MetaCarta was bought by Nokia who had no
> interest in the framework or the connectors.
>
> I do not, off the top of my head, remember the contractor's name nor have
> his contact information any longer.
>
> There are many users of the Documentum Connector, however, and I would
> hope one of them with more DQL experience will respond.
>
> Karl
>
>
>
> On Fri, Mar 17, 2023 at 5:41 AM Rasťa Šíša  wrote:
>
>> Hi Karl, thanks for your answer! Would you be able to point me towards
>> the author/git branch of the documentum connector?
>> Best regards, Rasta
>>
>> čt 16. 3. 2023 v 20:58 odesílatel Karl Wright 
>> napsal:
>>
>>> Hi,
>>>
>>> I didn't write the documentum connector initially, so I trust that the
>>> engineer who did knew how to construct the proper DQL.  I've not seen any
>>> bugs related to it so it does seem to work.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Mar 16, 2023 at 8:23 AM Rasťa Šíša  wrote:
>>>
 Hello,
 i would like to ask how does Documentum Manifold connector select the
 latest version from Documentum system?

 The first query that gets composed collects list of i_chronicle_id in
 DCTM.java. I would like to know though, how does the Manifold recognize the
 latest version of the document(e.g. Effective status).
 From the ui, i am able to select some of the objecttypes, but not
 objecttypes (all).

 In dql it is just e.g.
 *select i_chronicle_id from   *
 instead of *select i_chronicle_id from  (all)
 . *

 This "(all) object" returns all of them. The internal functioning of
 documentum though, with the first type of query, does not select
 i_chronicle_id of documents, that have a newly created version e.g. the
 document is created approved and effective, but someone already created a
 new draft for it. with the (all) in the dql, it brings in all the documents
 and their r_object_id, among which we can select the effective version by
 status.
 Is this a bug in manifold documentum connector, that it does not allow
 you to select those (all) objects and select those documents with new
 versions?
 Best regards,
 Rastislav Sisa

>>>

Re: Apache Manifold Documentum connector

2023-03-17 Thread Karl Wright

It was open-sourced back in 2012 at the same time ManifoldCF was
open-sourced.  It was written by a contractor paid by MetaCarta, who also
paid for the development of ManifoldCF itself (I developed that).  It was
spun off as open source when MetaCarta was bought by Nokia who had no
interest in the framework or the connectors.

I do not, off the top of my head, remember the contractor's name nor have
his contact information any longer.

There are many users of the Documentum Connector, however, and I would hope
one of them with more DQL experience will respond.

Karl



On Fri, Mar 17, 2023 at 5:41 AM Rasťa Šíša  wrote:

> Hi Karl, thanks for your answer! Would you be able to point me towards the
> author/git branch of the documentum connector?
> Best regards, Rasta
>
> čt 16. 3. 2023 v 20:58 odesílatel Karl Wright  napsal:
>
>> Hi,
>>
>> I didn't write the documentum connector initially, so I trust that the
>> engineer who did knew how to construct the proper DQL.  I've not seen any
>> bugs related to it so it does seem to work.
>>
>> Karl
>>
>>
>> On Thu, Mar 16, 2023 at 8:23 AM Rasťa Šíša  wrote:
>>
>>> Hello,
>>> i would like to ask how does Documentum Manifold connector select the
>>> latest version from Documentum system?
>>>
>>> The first query that gets composed collects list of i_chronicle_id in
>>> DCTM.java. I would like to know though, how does the Manifold recognize the
>>> latest version of the document(e.g. Effective status).
>>> From the ui, i am able to select some of the objecttypes, but not
>>> objecttypes (all).
>>>
>>> In dql it is just e.g.
>>> *select i_chronicle_id from   *
>>> instead of *select i_chronicle_id from  (all)
>>> . *
>>>
>>> This "(all) object" returns all of them. The internal functioning of
>>> documentum though, with the first type of query, does not select
>>> i_chronicle_id of documents, that have a newly created version e.g. the
>>> document is created approved and effective, but someone already created a
>>> new draft for it. with the (all) in the dql, it brings in all the documents
>>> and their r_object_id, among which we can select the effective version by
>>> status.
>>> Is this a bug in manifold documentum connector, that it does not allow
>>> you to select those (all) objects and select those documents with new
>>> versions?
>>> Best regards,
>>> Rastislav Sisa
>>>
>>

Re: Apache Manifold Documentum connector

2023-03-17 Thread Rasťa Šíša

Hi Karl, thanks for your answer! Would you be able to point me towards the
author/git branch of the documentum connector?
Best regards, Rasta

čt 16. 3. 2023 v 20:58 odesílatel Karl Wright  napsal:

> Hi,
>
> I didn't write the documentum connector initially, so I trust that the
> engineer who did knew how to construct the proper DQL.  I've not seen any
> bugs related to it so it does seem to work.
>
> Karl
>
>
> On Thu, Mar 16, 2023 at 8:23 AM Rasťa Šíša  wrote:
>
>> Hello,
>> i would like to ask how does Documentum Manifold connector select the
>> latest version from Documentum system?
>>
>> The first query that gets composed collects list of i_chronicle_id in
>> DCTM.java. I would like to know though, how does the Manifold recognize the
>> latest version of the document(e.g. Effective status).
>> From the ui, i am able to select some of the objecttypes, but not
>> objecttypes (all).
>>
>> In dql it is just e.g.
>> *select i_chronicle_id from   *
>> instead of *select i_chronicle_id from  (all)
>> . *
>>
>> This "(all) object" returns all of them. The internal functioning of
>> documentum though, with the first type of query, does not select
>> i_chronicle_id of documents, that have a newly created version e.g. the
>> document is created approved and effective, but someone already created a
>> new draft for it. with the (all) in the dql, it brings in all the documents
>> and their r_object_id, among which we can select the effective version by
>> status.
>> Is this a bug in manifold documentum connector, that it does not allow
>> you to select those (all) objects and select those documents with new
>> versions?
>> Best regards,
>> Rastislav Sisa
>>
>

Re: Apache Manifold Documentum connector

2023-03-16 Thread Karl Wright

Hi,

I didn't write the documentum connector initially, so I trust that the
engineer who did knew how to construct the proper DQL.  I've not seen any
bugs related to it so it does seem to work.

Karl


On Thu, Mar 16, 2023 at 8:23 AM Rasťa Šíša  wrote:

> Hello,
> i would like to ask how does Documentum Manifold connector select the
> latest version from Documentum system?
>
> The first query that gets composed collects list of i_chronicle_id in
> DCTM.java. I would like to know though, how does the Manifold recognize the
> latest version of the document(e.g. Effective status).
> From the ui, i am able to select some of the objecttypes, but not
> objecttypes (all).
>
> In dql it is just e.g.
> *select i_chronicle_id from   *
> instead of *select i_chronicle_id from  (all)
> . *
>
> This "(all) object" returns all of them. The internal functioning of
> documentum though, with the first type of query, does not select
> i_chronicle_id of documents, that have a newly created version e.g. the
> document is created approved and effective, but someone already created a
> new draft for it. with the (all) in the dql, it brings in all the documents
> and their r_object_id, among which we can select the effective version by
> status.
> Is this a bug in manifold documentum connector, that it does not allow you
> to select those (all) objects and select those documents with new versions?
> Best regards,
> Rastislav Sisa
>

Re: Job stucked with cleaning up status

2023-02-03 Thread Karl Wright

The shutdown procedure for ManifoldCF involves sending interruptions (or
socket interruptions) to all worker threads.  These then put the threads in
the "terminated" state, one by one.  So you should only get this if you
shut down the agents process, or try to.  The handling for this is correct,
although sometimes embedded libraries do not handle thread shutdown
requests properly.

Anyhow, the cause of the problem is actually the fact that the output
connection cannot talk to the service, as stated.

Karl


On Fri, Feb 3, 2023 at 12:54 AM Artem Abeleshev <
artem.abeles...@rondhuit.com> wrote:

> Karl, good day!
>
> Thank you for the hint! It was very useful! Actually, you was right and
> the actual problem was about the connection. But I doesn't expect it would
> be so dramatic. Here is what I found using some debugging:
>
> First I have found the actual code that was responsible for the deletion
> of the documents. It was called by the `DocumentDeleteThread`
> (`org.apache.manifoldcf.crawler.system.DocumentDeleteThread`). Then I
> checked how many `DocumentDeleteThread` threads supposed to be started. I
> haven't override the value and got default 10 threads. Then I grabbed
> thread dump and check those threads. I found two strange things:
>
> 1. Not all threads were alived. Some of them were terminated.
> 2. Some live threads have a huge amount of supplementary zk threads like
> `Worker thread 'x'-EventThread` and `Worker thread 'n'-EventThread(...)`.
> Even the threads that already have been termanted also leave behind theirs
> supplementary threads (since they are deamon threads). As a result I have
> from 1000 to 2000 threads in total.
>
> I starting to debug the lived threads and come up to the `deletePost`
> method of `HttpPoster`
> (`org.apache.manifoldcf.agents.output.solr.HttpPoster.deletePost(String,
> IOutputRemoveActivity)`). Here I was always getting an exception:
>
> ```java
> org.apache.solr.client.solrj.SolrServerException: IOException occurred
> when talking to server at: http://10.78.11.71:8983/solr/jaweb
> org.apache.http.conn.ConnectTimeoutException: Connect to 10.78.11.71:8983
> [/10.78.11.71] failed: connect timed out
> ```
>
> An exception was due to the Solr was unavailable (i.e. shut down), so here
> is no surprise. But the following was a true surpise for me. An exception
> I've got is of type `IOException`. Inside the `HttpPoster` that exception
> in the end is handled by the method `handleIOException`
> (org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(IOException,
> String)):
>
> ```java
>   protected static void handleIOException(IOException e, String context)
> throws ManifoldCFException, ServiceInterruption
>   {
> if ((e instanceof InterruptedIOException) && (!(e instanceof
> java.net.SocketTimeoutException)))
>   throw new ManifoldCFException(e.getMessage(),
> ManifoldCFException.INTERRUPTED);
> ...
>   }
> ```
>
> As we can see an exception is wrapped with the `ManifoldCFException`
> exception and assigned with the `INTERRUPTED` error code. Then this
> exception is bubbling up unitl it ends up in the main loop of the
> `DocumentDeleteThread`. Here is the full stack I extract during debug
> (unfortunately not a single exception is logged on the way):
>
> ```java
>
> org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(HttpPoster.java:514),
>
> org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrServerException(HttpPoster.java:427),
>
> org.apache.manifoldcf.agents.output.solr.HttpPoster.deletePost(HttpPoster.java:817),
>
> org.apache.manifoldcf.agents.output.solr.SolrConnector.removeDocument(SolrConnector.java:594),
>
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.removeDocument(IncrementalIngester.java:2296),
>
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentDeleteMultiple(IncrementalIngester.java:1037),
>
> org.apache.manifoldcf.crawler.system.DocumentDeleteThread.run(DocumentDeleteThread.java:134)
> ```
>
> Inside the main loop of the `DocumentDeleteThread` that exception is
> handles like this:
>
> ```java
> public void run()
>   {
> try
> {
>   ...
>   // Loop
>   while (true)
>   {
> // Do another try/catch around everything in the loop
> try
> {
>   ...
> }
> catch (ManifoldCFException e)
> {
>   if (e.getErrorCode() == ManifoldCFException.INTERRUPTED)
> break;
> ...
> }
>   ...
>   }
> }
> catch (Throwable e)
> {
>   ...
> }
>   }
> ```
>
> It just breaks the loop making thread terminates normally! In a quite a
> short time I always ends up with no `DocumentDeleteThread`s at all and the
> framework transit to the incosistent state.
>
> In the end, I made Solr back online and managed to finish deletion
> successfully. But I think this case should be handled in some way.
>
> With respect,
> Abeleshev

Re: Job stucked with cleaning up status

2023-02-02 Thread Artem Abeleshev

Karl, good day!

Thank you for the hint! It was very useful! Actually, you was right and the
actual problem was about the connection. But I doesn't expect it would be
so dramatic. Here is what I found using some debugging:

First I have found the actual code that was responsible for the deletion of
the documents. It was called by the `DocumentDeleteThread`
(`org.apache.manifoldcf.crawler.system.DocumentDeleteThread`). Then I
checked how many `DocumentDeleteThread` threads supposed to be started. I
haven't override the value and got default 10 threads. Then I grabbed
thread dump and check those threads. I found two strange things:

1. Not all threads were alived. Some of them were terminated.
2. Some live threads have a huge amount of supplementary zk threads like
`Worker thread 'x'-EventThread` and `Worker thread 'n'-EventThread(...)`.
Even the threads that already have been termanted also leave behind theirs
supplementary threads (since they are deamon threads). As a result I have
from 1000 to 2000 threads in total.

I starting to debug the lived threads and come up to the `deletePost`
method of `HttpPoster`
(`org.apache.manifoldcf.agents.output.solr.HttpPoster.deletePost(String,
IOutputRemoveActivity)`). Here I was always getting an exception:

```java
org.apache.solr.client.solrj.SolrServerException: IOException occurred when
talking to server at: http://10.78.11.71:8983/solr/jaweb
org.apache.http.conn.ConnectTimeoutException: Connect to 10.78.11.71:8983 [/
10.78.11.71] failed: connect timed out
```

An exception was due to the Solr was unavailable (i.e. shut down), so here
is no surprise. But the following was a true surpise for me. An exception
I've got is of type `IOException`. Inside the `HttpPoster` that exception
in the end is handled by the method `handleIOException`
(org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(IOException,
String)):

```java
  protected static void handleIOException(IOException e, String context)
throws ManifoldCFException, ServiceInterruption
  {
if ((e instanceof InterruptedIOException) && (!(e instanceof
java.net.SocketTimeoutException)))
  throw new ManifoldCFException(e.getMessage(),
ManifoldCFException.INTERRUPTED);
...
  }
```

As we can see an exception is wrapped with the `ManifoldCFException`
exception and assigned with the `INTERRUPTED` error code. Then this
exception is bubbling up unitl it ends up in the main loop of the
`DocumentDeleteThread`. Here is the full stack I extract during debug
(unfortunately not a single exception is logged on the way):

```java
org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(HttpPoster.java:514),
org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrServerException(HttpPoster.java:427),
org.apache.manifoldcf.agents.output.solr.HttpPoster.deletePost(HttpPoster.java:817),
org.apache.manifoldcf.agents.output.solr.SolrConnector.removeDocument(SolrConnector.java:594),
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.removeDocument(IncrementalIngester.java:2296),
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentDeleteMultiple(IncrementalIngester.java:1037),
org.apache.manifoldcf.crawler.system.DocumentDeleteThread.run(DocumentDeleteThread.java:134)
```

Inside the main loop of the `DocumentDeleteThread` that exception is
handles like this:

```java
public void run()
  {
try
{
  ...
  // Loop
  while (true)
  {
// Do another try/catch around everything in the loop
try
{
  ...
}
catch (ManifoldCFException e)
{
  if (e.getErrorCode() == ManifoldCFException.INTERRUPTED)
break;
...
}
  ...
  }
}
catch (Throwable e)
{
  ...
}
  }
```

It just breaks the loop making thread terminates normally! In a quite a
short time I always ends up with no `DocumentDeleteThread`s at all and the
framework transit to the incosistent state.

In the end, I made Solr back online and managed to finish deletion
successfully. But I think this case should be handled in some way.

With respect,
Abeleshev Artem

On Sun, Jan 29, 2023 at 10:36 PM Karl Wright  wrote:

> Hi,
>
> 2.22 makes no changes to the way document deletions are processed over
> probably 10 previous versions of ManifoldCF.
>
> What likely is the case is that the connection to the output for the job
> you are cleaning up is down.  When that happens, the documents are queued
> but the delete worker threads cannot make any progress.
>
> You can see this maybe by looking at the "Simple Reports" for the job in
> question and see what it is doing and why the deletions are not succeeding.
>
> Karl
>
>
> On Sun, Jan 29, 2023 at 8:18 AM Artem Abeleshev <
> artem.abeles...@rondhuit.com> wrote:
>
>> Hi, everyone!
>>
>> Another problem that I got sometimes. We are using ManifoldCF 2.22.1 with
>> multiple nodes in our production. The creation of the MCF

Re: JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy

2023-02-01 Thread Karl Wright

It looks like you are running with a profiler?  That uses a lot of memory.
Karl


On Wed, Feb 1, 2023 at 8:06 AM Bisonti Mario 
wrote:

> This is my hs_err_pid_.log
>
>
>
> Command Line: -Xms32768m -Xmx32768m
> -Dorg.apache.manifoldcf.configfile=./properties.xml
> -Djava.security.auth.login.con
>
> fig= -Dorg.apache.manifoldcf.processid=A
> org.apache.manifoldcf.agents.AgentRun
>
>
>
> .
>
> .
>
> .
>
> CodeHeap 'non-profiled nmethods': size=120032Kb used=23677Kb
> max_used=23677Kb free=96354Kb
>
> CodeHeap 'profiled nmethods': size=120028Kb used=20405Kb max_used=27584Kb
> free=99622Kb
>
> CodeHeap 'non-nmethods': size=5700Kb used=1278Kb max_used=1417Kb
> free=4421Kb
>
> Memory: 4k page, physical 72057128k(7300332k free), swap 4039676k(4039676k
> free)
>
> .
>
> .
>
>
>
> Perhaps could be a RAM problem?
>
>
>
> Thanks a lot
>
>
>
>
>
>
>
>
>
> *Da:* Bisonti Mario
> *Inviato:* venerdì 20 gennaio 2023 10:28
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: JCIFS: Possibly transient exception detected on attempt 1
> while getting share security: All pipe instances are busy
>
>
>
> I see that the agent crashed:
>
> #
>
> # A fatal error has been detected by the Java Runtime Environment:
>
> #
>
> #  Internal Error (g1ConcurrentMark.cpp:1665), pid=2537463, tid=2537470
>
> #  fatal error: Overflow during reference processing, can not continue.
> Please increase MarkStackSizeMax (current value: 16777216) and restart.
>
> #
>
> # JRE version: OpenJDK Runtime Environment (11.0.16+8) (build
> 11.0.16+8-post-Ubuntu-0ubuntu120.04)
>
> # Java VM: OpenJDK 64-Bit Server VM (11.0.16+8-post-Ubuntu-0ubuntu120.04,
> mixed mode, tiered, g1 gc, linux-amd64)
>
> # Core dump will be written. Default location: Core dumps may be processed
> with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E"
> (or dumping to
> /opt/manifoldcf/multiprocess-zk-example-proprietary/core.2537463)
>
> #
>
> # If you would like to submit a bug report, please visit:
>
> #   https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
>
> #
>
>
>
> ---  S U M M A R Y 
>
>
>
> Command Line: -Xms32768m -Xmx32768m
> -Dorg.apache.manifoldcf.configfile=./properties.xml
> -Djava.security.auth.login.config= -Dorg.apache.manifoldcf.processid=A
> org.apache.manifoldcf.agents.AgentRun
>
>
>
> Host: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 8 cores, 68G, Ubuntu
> 20.04.4 LTS
>
> Time: Fri Jan 20 09:38:54 2023 CET elapsed time: 54532.106681 seconds (0d
> 15h 8m 52s)
>
>
>
> ---  T H R E A D  ---
>
>
>
> Current thread (0x7f051940a000):  VMThread "VM Thread" [stack:
> 0x7f051c50a000,0x7f051c60a000] [id=2537470]
>
>
>
> Stack: [0x7f051c50a000,0x7f051c60a000],  sp=0x7f051c608080,
> free space=1016k
>
> Native frames: (J=compiled Java code, A=aot compiled Java code,
> j=interpreted, Vv=VM code, C=native code)
>
> V  [libjvm.so+0xe963a9]
>
> V  [libjvm.so+0x67b504]
>
> V  [libjvm.so+0x7604e6]
>
>
>
>
>
> So, where could I change that parameter?
>
> Is it an Agent configuration?
>
> Thanks a lot
>
> Mario
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* mercoledì 18 gennaio 2023 14:59
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: JCIFS: Possibly transient exception detected on attempt 1
> while getting share security: All pipe instances are busy
>
>
>
> When you get a hang like this, getting a thread dump of the agents process
> is essential to figure out what the issue is.  You can't assume that a
> transient error would block anything because that's not how ManifoldCF
> works, at all.  Errors push the document in question back onto the queue
> with a retry time.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jan 18, 2023 at 6:15 AM Bisonti Mario 
> wrote:
>
> Hi Karl.
>
> But I noted that the job was hanging, the document processed was stucked
> on the same number, no further document processing from the 6 a.m until I
> restart Agent
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* mercoledì 18 gennaio 2023 12:10
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: JCIFS: Possibly transient exception detected on attempt 1
> while getting share security: All pipe instances are busy
>
>
>
> Hi, "Possibly transient issue" means that the error will be retried
> anyway, according to a schedule.  There should not need to be any
> requirement to shut down the agents proce

Re: Job stucked with cleaning up status

2023-01-29 Thread Karl Wright

Hi,

2.22 makes no changes to the way document deletions are processed over
probably 10 previous versions of ManifoldCF.

What likely is the case is that the connection to the output for the job
you are cleaning up is down.  When that happens, the documents are queued
but the delete worker threads cannot make any progress.

You can see this maybe by looking at the "Simple Reports" for the job in
question and see what it is doing and why the deletions are not succeeding.

Karl


On Sun, Jan 29, 2023 at 8:18 AM Artem Abeleshev <
artem.abeles...@rondhuit.com> wrote:

> Hi, everyone!
>
> Another problem that I got sometimes. We are using ManifoldCF 2.22.1 with
> multiple nodes in our production. The creation of the MCF job pipeline is
> handled via the API calls from our service. We create jobs, repositories
> and output repositories. The crawler extracts documents and then they are
> pushed to the Solr. The pipeline works OK.
>
> The problem is about deleteing the job. Sometimes the job get stucked with
> a `Cleaning up` status (in DB it has status `e` that corresponds to status
> `STATUS_DELETING`). This time I have used MCF Web Admin to delete the job
> (pressed the delete button on the job list page).
>
> I have checked sources and debug it a bit. The method
> `deleteJobsReadyForDelete()`
> (`org.apache.manifoldcf.crawler.jobs.JobManager.deleteJobsReadyForDelete()`)
> is works OK. It is unable to delete the job cause it still found some
> documents in the document's queue table. The following SQL is executed
> within this method:
>
> ```sql
> select id from jobqueue where jobid = '1658215015582' and (status = 'E' or
> status = 'D') limit 1;
> ```
>
> where `E` status stands for `STATUS_ELIGIBLEFORDELETE` and `D` status
> stands for `STATUS_BEINGDELETED`. If at least one of such a documents is
> found in the queue it will do nothing. At the moment I had a lot of
> documents resided within the `jobqueue` having indicated statuses (actually
> all of them have `D` status).
>
> I see that `Documents delete stuffer thread` is running, and it set status
> `STATUS_BEINGDELETED` to the documents via the
> `getNextDeletableDocuments()` method
> (`org.apache.manifoldcf.crawler.jobs.JobManager.getNextDeletableDocuments(String,
> int, long)`). But I can't find any logic that actually deletes the
> documents. I've searched throught the sources, but status
> `STATUS_BEINGDELETED` mentioned mostly in `NOT EXISTS ...` queries.
> Searching in reverse order from `JobQueue`
> (`org.apache.manifoldcf.crawler.jobs.JobQueue`) also doesn't give result to
> me. I will be appreciated if somewone can point where to look, so I can
> debug and check what conditions are preventing documents to be removed.
>
> Thank you!
>
> With respect,
> Artem Abeleshev
>

Re: JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy

2023-01-18 Thread Karl Wright

When you get a hang like this, getting a thread dump of the agents process
is essential to figure out what the issue is.  You can't assume that a
transient error would block anything because that's not how ManifoldCF
works, at all.  Errors push the document in question back onto the queue
with a retry time.

Karl


On Wed, Jan 18, 2023 at 6:15 AM Bisonti Mario 
wrote:

> Hi Karl.
>
> But I noted that the job was hanging, the document processed was stucked
> on the same number, no further document processing from the 6 a.m until I
> restart Agent
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* mercoledì 18 gennaio 2023 12:10
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: JCIFS: Possibly transient exception detected on attempt 1
> while getting share security: All pipe instances are busy
>
>
>
> Hi, "Possibly transient issue" means that the error will be retried
> anyway, according to a schedule.  There should not need to be any
> requirement to shut down the agents process and restart.
>
> Karl
>
>
>
> On Wed, Jan 18, 2023 at 5:08 AM Bisonti Mario 
> wrote:
>
> Hi.
>
> Often, I obtain the error:
>
> WARN 2023-01-18T06:18:19,316 (Worker thread '89') - JCIFS: Possibly
> transient exception detected on attempt 1 while getting share security: All
> pipe instances are busy.
>
> jcifs.smb.SmbException: All pipe instances are busy.
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus2(SmbTransportImpl.java:1441)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus(SmbTransportImpl.java:1552)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1007)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:409)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeImpl.send(SmbTreeImpl.java:472)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send0(SmbTreeConnection.java:399)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:314)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:294)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:130)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:117)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.openUnshared(SmbFile.java:665)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbPipeHandleImpl.ensureOpen(SmbPipeHandleImpl.java:169)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbPipeHandleImpl.sendrecv(SmbPipeHandleImpl.java:250)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendReceiveFragment(DcerpcPipeHandle.java:113)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:243)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:216)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:234)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2337)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2468)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1243)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:647)
> [mcf-jcifs-connector.jar:?]
>
>
>
> So, I have to stop the agent, restart it, and the crwling continues.
>
>
>
> How could I solve my issue?
> Thanks a lot.
>
> Mario
>
>

Re: JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy

2023-01-18 Thread Karl Wright

Hi, "Possibly transient issue" means that the error will be retried anyway,
according to a schedule.  There should not need to be any requirement to
shut down the agents process and restart.
Karl

On Wed, Jan 18, 2023 at 5:08 AM Bisonti Mario 
wrote:

> Hi.
>
> Often, I obtain the error:
>
> WARN 2023-01-18T06:18:19,316 (Worker thread '89') - JCIFS: Possibly
> transient exception detected on attempt 1 while getting share security: All
> pipe instances are busy.
>
> jcifs.smb.SmbException: All pipe instances are busy.
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus2(SmbTransportImpl.java:1441)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus(SmbTransportImpl.java:1552)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1007)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:409)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeImpl.send(SmbTreeImpl.java:472)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send0(SmbTreeConnection.java:399)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:314)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:294)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:130)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:117)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.openUnshared(SmbFile.java:665)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbPipeHandleImpl.ensureOpen(SmbPipeHandleImpl.java:169)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbPipeHandleImpl.sendrecv(SmbPipeHandleImpl.java:250)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendReceiveFragment(DcerpcPipeHandle.java:113)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:243)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:216)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:234)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2337)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2468)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1243)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:647)
> [mcf-jcifs-connector.jar:?]
>
>
>
> So, I have to stop the agent, restart it, and the crwling continues.
>
>
>
> How could I solve my issue?
> Thanks a lot.
>
> Mario
>

Re: Help for subscribing the user mailing list of MCF

2023-01-10 Thread Koji Sekiguchi

Hi Karl,

I agree. BTW, Artem, the colleague, finally succeeded to subscribe. He
tried to subscribe some more times before opening JIRA ticket in
INFRA, and he finally got some responses from the ML system. Maybe
they restarted the system or did something else.

Thanks!

Koji

2023年1月10日(火) 20:17 Karl Wright :
>
> Hmm - I haven't heard of difficulties like this before.  The mail manager is 
> used apache-wide; if it doesn't work the best thing to do would be to create 
> an infra ticket in JIRA.
>
> Karl
>
>
> On Tue, Jan 10, 2023 at 3:50 AM Koji Sekiguchi  
> wrote:
>>
>> Hi Karl, everyone!
>>
>> I'm writing to the moderator of the MCF mailing list.
>>
>> I'd like you to help my colleague to subscribe to MCF user mailing list.
>> He's tried to subscribe several times by sending the request to
>> user-subscr...@manifoldcf.apache.org but he said that it seemed that
>> they were just ignored and he couldn't get any responses from the
>> system.
>> The email address is abeleshev at gmail dot com.
>>
>> He has some questions and wants to contribute something if possible.
>>
>> Thanks!
>>
>> Koji

Re: Help for subscribing the user mailing list of MCF

2023-01-10 Thread Karl Wright

Hmm - I haven't heard of difficulties like this before.  The mail manager
is used apache-wide; if it doesn't work the best thing to do would be to
create an infra ticket in JIRA.

Karl

On Tue, Jan 10, 2023 at 3:50 AM Koji Sekiguchi 
wrote:

> Hi Karl, everyone!
>
> I'm writing to the moderator of the MCF mailing list.
>
> I'd like you to help my colleague to subscribe to MCF user mailing list.
> He's tried to subscribe several times by sending the request to
> user-subscr...@manifoldcf.apache.org but he said that it seemed that
> they were just ignored and he couldn't get any responses from the
> system.
> The email address is abeleshev at gmail dot com.
>
> He has some questions and wants to contribute something if possible.
>
> Thanks!
>
> Koji
>

Re: Is Manifold capable of handling these kind of files

2022-12-23 Thread Karl Wright

The internals of ManifoldCF will handle this fine if you are sure to set
the encoding of your database to be UTF-8.  However, I don't know about the
JCIFS library, and whether there might be a restriction on characters in
that code base.  I think you'd have to just try it and see, frankly.

Karl

On Fri, Dec 23, 2022 at 6:52 AM Priya Arora  wrote:

> Hi
>
> Is Manifold capable of handling this kind (ingesting) of file in window
> shares connector which has special characters like these
>
> demo/11208500/11208550/I. Proposal/PHASE II/220808
> Input/__MACOSX/虎尾/._62A33A6377CF08B472CC2AB562BD8B5D.JPG
>
>
> Any reply would be appreciated
>

Re: Manifoldcf -XML parsing error: Character reference "" is an invalid XML character.

2022-12-22 Thread ritika jain

Can anybody provide any clue on this. Would be of great help

On Thu, Dec 22, 2022 at 5:33 PM ritika jain 
wrote:

> Hi all,
>
> I am using Manifoldcf 2.21 version with Windows shares connector and
> Output as Elastic.
> I am facing this error while clicking "List all jobs", Manifoldcf,  jobs
> are being run/create in such a way that our API is creating a manifold job
> object and thus creating/starting a job in manifold (on server) from API
> hit,  these are being processed/automated via cron jobs. Everytime job is
> being created from API (in Symphony) in manifold , job does not start and
> looks like the below screenshot (2),  & when clicking on "List all jobs" to
> view job structure at least , it straight away gives this error/.
> 1)
> [image: image.png]
> 2)
>
>
> [image: image.png]
>
> I have tried all exclusions -to exclude xml files , exclude any files
> which can have special characters like (â€“), also exclude tild sign etc
> (~) , because when like searched it looks like this issue , but still
> after this issue persists.
>
> Can anybody help out why manifold gives this error , when a certain job
> (from cron) is created. How to handle it
>
> Thanks
> Ritika
>
>
>

Re: Unscribe

2022-10-22 Thread Muhammed Olgun

Hi Ronny,

Unsubscribing is self-service. Please follow here,

https://manifoldcf.apache.org/en_US/mail.html


On 22 Oct 2022 Sat at 08:55 Ronny Heylen  wrote:

> Hi,
> Please unscribe me from these emails, I don't work anymore.
>
> Regards,
>
> Ronny
>

Re: Frequent error while window shares job

2022-08-22 Thread Karl Wright

You will need to contact the current maintainers of the Jcifs library to
get answers to these questions.

Karl


On Mon, Aug 22, 2022 at 3:27 AM ritika jain 
wrote:

> Hi All,
>
> I have a Windows shared job to crawl files from samba server, it's a  huge
> job to crawl documents in millions(about 10). While running a job , we
> encounter two types of errors very frequently.
>
> 1)  WARN 2022-08-19T17:17:05,175 (Worker thread '7') - JCIFS: Possibly
> transient exception detected on attempt 3 while getting share security:
> Disconnecting during tree connect
> jcifs.smb.SmbException: Disconnecting during tree connect-- in what case
> it can come
>
> 2) WARN 2019-08-25T15:02:27,416 (Worker thread '11') - Service
> interruption reported for job 1565115290083 connection 'fs_vwoaahvp319':
> Timeout or other service interruption: The process cannot access the file
> because it is being used by another process.
>
> What can be the reason for this?. Can anybody please help how we can make
> the job skip the error (if for any particular file), and then let the job
> run without an abort.
>
> Thanks
> Ritika
>

Re: Can't delete a job when solr output connection can't connect to the instance.

2022-06-14 Thread Karl Wright

Remember, there is already a "forget" button on the output connection,
which will remove everything associated with the connection.  It's meant to
be used when the output index has been reset and is empty.  I'm not sure
what you'd do different functionally.

Karl


On Tue, Jun 14, 2022 at 2:04 AM Koji Sekiguchi 
wrote:

> +1.
>
> I respect for the design concept of ManifoldCF, but I think force delete
> options make MCF more
> useful for those who use MCF as crawler. Adding force delete options
> doesn't change default
> behaviors and it doesn't break back-compatibility.
>
> Koji
>
> On 2022/06/14 14:46, Ricardo Ruiz wrote:
> > Hi Karl
> > We are using  ManifoldCF as a crawler more than a synchronizer. We are
> thinking of contributing to
> > ManifoldCf by including a force job delete and force output connector
> delete, considering of course
> > the things that need to be deleted with them (BD, etc). Do you think
> this is possible?
> > We think that not only us but the community would be benefited from this
> kind of functionality.
> >
> > Ricardo.
> >
> > On Mon, Jun 13, 2022 at 7:34 PM Karl Wright  daddy...@gmail.com>> wrote:
> >
> > Because ManifoldCF is not just a crawler, but a synchonizer, a job
> represents and includes a
> > list of documents that have been indexed.  Deleting the job requires
> deleting the documents that
> > have been indexed also.  It's part of the basic model.
> >
> > So if you tear down your target output instance and then try to tear
> down the job, it won't
> > work.  ManifoldCF won't just throw away the memory of those
> documents and act as if nothing
> > happened.
> >
> > If you're just using ManifoldCF as a crawler, therefore, your fix is
> about as good as it gets.
> >
> > You can get into similar trouble if, for example, you reinstall
> ManifoldCF but forget to include
> > a connector class that was there before.  Carnage ensues.
> >
> > Karl
> >
> >
> > On Mon, Jun 13, 2022 at 1:39 AM Ricardo Ruiz  > > wrote:
> >
> > Hi all
> > My team uses mcf to crawl documents and index into solr
> instances, but for reasons beyond
> > our control, sometimes the instances or collections are deleted.
> > When we try to delete a job and the solr instance or collection
> doesn't exist anymore, the
> > job reaches the "End notification" status and gets stuck there.
> No other job can be aborted
> > or deleted until the initial error is fixed.
> >
> > We are able to clean up the errors following the next steps:
> >
> > 1.  Reconfigure the output connector to an existing Solr
> instance and collection
> > 2.  Reset the output connection, so it forgets any indexed
> documents.
> > 3.  Reset the job, so it forgets any indexed documents.
> > 4.  Restart the ManifoldCF server.
> >
> > Is there any other way we can solve this error? Is there any way
> we can force delete the job
> > if we don't care about the job's documents anymore?
> >
> > Thanks in advance.
> > Ricardo.
> >
>

Re: Can't delete a job when solr output connection can't connect to the instance.

2022-06-14 Thread Koji Sekiguchi

+1.

I respect for the design concept of ManifoldCF, but I think force delete options make MCF more
useful for those who use MCF as crawler. Adding force delete options doesn't change default
behaviors and it doesn't break back-compatibility.

Koji

On 2022/06/14 14:46, Ricardo Ruiz wrote:

Hi Karl
We are using ManifoldCF as a crawler more than a synchronizer. We are thinking of contributing to
ManifoldCf by including a force job delete and force output connector delete, considering of course
the things that need to be deleted with them (BD, etc). Do you think this is possible?

We think that not only us but the community would be benefited from this kind
of functionality.

Ricardo.

On Mon, Jun 13, 2022 at 7:34 PM Karl Wright mailto:daddy...@gmail.com>> wrote:

Because ManifoldCF is not just a crawler, but a synchonizer, a job
represents and includes a
list of documents that have been indexed. Deleting the job requires
deleting the documents that
have been indexed also. It's part of the basic model.

So if you tear down your target output instance and then try to tear down
the job, it won't
work. ManifoldCF won't just throw away the memory of those documents and
act as if nothing
happened.

If you're just using ManifoldCF as a crawler, therefore, your fix is about
as good as it gets.

You can get into similar trouble if, for example, you reinstall ManifoldCF
but forget to include
a connector class that was there before. Carnage ensues.

Karl

On Mon, Jun 13, 2022 at 1:39 AM Ricardo Ruiz mailto:ricrui3s...@gmail.com>> wrote:

Hi all
My team uses mcf to crawl documents and index into solr instances, but
for reasons beyond
our control, sometimes the instances or collections are deleted.
When we try to delete a job and the solr instance or collection doesn't
exist anymore, the
job reaches the "End notification" status and gets stuck there. No
other job can be aborted
or deleted until the initial error is fixed.

We are able to clean up the errors following the next steps:

1. Reconfigure the output connector to an existing Solr instance and
collection
2. Reset the output connection, so it forgets any indexed documents.
3. Reset the job, so it forgets any indexed documents.
4. Restart the ManifoldCF server.

Is there any other way we can solve this error? Is there any way we can
force delete the job
if we don't care about the job's documents anymore?

Thanks in advance.
Ricardo.

Re: Can't delete a job when solr output connection can't connect to the instance.

2022-06-13 Thread Ricardo Ruiz

Hi Karl
We are using  ManifoldCF as a crawler more than a synchronizer. We are
thinking of contributing to ManifoldCf by including a force job delete and
force output connector delete, considering of course the things that need
to be deleted with them (BD, etc). Do you think this is possible?
We think that not only us but the community would be benefited from this
kind of functionality.

Ricardo.

On Mon, Jun 13, 2022 at 7:34 PM Karl Wright  wrote:

> Because ManifoldCF is not just a crawler, but a synchonizer, a job
> represents and includes a list of documents that have been indexed.
> Deleting the job requires deleting the documents that have been indexed
> also.  It's part of the basic model.
>
> So if you tear down your target output instance and then try to tear down
> the job, it won't work.  ManifoldCF won't just throw away the memory of
> those documents and act as if nothing happened.
>
> If you're just using ManifoldCF as a crawler, therefore, your fix is about
> as good as it gets.
>
> You can get into similar trouble if, for example, you reinstall ManifoldCF
> but forget to include a connector class that was there before.  Carnage
> ensues.
>
> Karl
>
>
> On Mon, Jun 13, 2022 at 1:39 AM Ricardo Ruiz 
> wrote:
>
>> Hi all
>> My team uses mcf to crawl documents and index into solr instances, but
>> for reasons beyond our control, sometimes the instances or collections are
>> deleted.
>> When we try to delete a job and the solr instance or collection doesn't
>> exist anymore, the job reaches the "End notification" status and gets stuck
>> there. No other job can be aborted or deleted until the initial error is
>> fixed.
>>
>> We are able to clean up the errors following the next steps:
>>
>> 1.  Reconfigure the output connector to an existing Solr instance and
>> collection
>> 2.  Reset the output connection, so it forgets any indexed documents.
>> 3.  Reset the job, so it forgets any indexed documents.
>> 4.  Restart the ManifoldCF server.
>>
>> Is there any other way we can solve this error? Is there any way we can
>> force delete the job if we don't care about the job's documents anymore?
>>
>> Thanks in advance.
>> Ricardo.
>>
>

Re: Can't delete a job when solr output connection can't connect to the instance.

2022-06-13 Thread Karl Wright

Because ManifoldCF is not just a crawler, but a synchonizer, a job
represents and includes a list of documents that have been indexed.
Deleting the job requires deleting the documents that have been indexed
also.  It's part of the basic model.

So if you tear down your target output instance and then try to tear down
the job, it won't work.  ManifoldCF won't just throw away the memory of
those documents and act as if nothing happened.

If you're just using ManifoldCF as a crawler, therefore, your fix is about
as good as it gets.

You can get into similar trouble if, for example, you reinstall ManifoldCF
but forget to include a connector class that was there before.  Carnage
ensues.

Karl

On Mon, Jun 13, 2022 at 1:39 AM Ricardo Ruiz  wrote:

> Hi all
> My team uses mcf to crawl documents and index into solr instances, but for
> reasons beyond our control, sometimes the instances or collections are
> deleted.
> When we try to delete a job and the solr instance or collection doesn't
> exist anymore, the job reaches the "End notification" status and gets stuck
> there. No other job can be aborted or deleted until the initial error is
> fixed.
>
> We are able to clean up the errors following the next steps:
>
> 1.  Reconfigure the output connector to an existing Solr instance and
> collection
> 2.  Reset the output connection, so it forgets any indexed documents.
> 3.  Reset the job, so it forgets any indexed documents.
> 4.  Restart the ManifoldCF server.
>
> Is there any other way we can solve this error? Is there any way we can
> force delete the job if we don't care about the job's documents anymore?
>
> Thanks in advance.
> Ricardo.
>

Re: Job Service Interruption- and stops

2022-04-29 Thread Karl Wright

" repeated service interruption" means that it happens again and again.

For this particular document, the problem is that the error we are seeing
is: "The process cannot access the file because it is being used by another
process."

ManifoldCF assumes that if it retries enough it should be able to read the
document eventually.  In this case, if it cannot read the document after 6+
hours, it assumes something is wrong and stops the job.  We can make it
continue at this point but the issue is that you shouldn't be seeing such
an error for such a long period of time.  Perhaps you might want to
research why this is taking place.

Karl

On Fri, Apr 29, 2022 at 4:54 AM ritika jain 
wrote:

> Hi All,
>
> With the window shares connector, on the server I am getting this
> exception and due to repeated service interruption *job stops.*
>
> Error: Repeated service interruptions - failure processing document: The
> process cannot access the file because it is being used by another process.
>
> How we can prevent this.
> I read in the code that it retries the document. But due to service
> interruptions, the jobs stopped.
>
>
> Thanks
> Ritika
>

Re: Log4j Update Doubt

2022-03-15 Thread Karl Wright

We cannot do back patches of older versions of ManifoldCF.  There is a new
release which shipped in January that addresses log4j issues.  I suggest
updating to that.

Karl


On Tue, Mar 15, 2022 at 8:59 AM ritika jain 
wrote:

> Hi,
>
> How manifoldcf uses log4j files in bin directory/distribution. If this is
> the location "D:\\Manifoldcf\apache-manifoldcf-2.14\lib" that is the lib
> folder only.(for physical file presence)
>
> Also if the log4j dependency issue has been resolved and the version 2.15
> or higher is updated, then will it be reflected in Manifoldcf 2.14 version
> also. If not , can you help to let me know at what places the log4j2.4.1
> jar files should be replaced with 2.15.
>
> When the log4j2.15 jar has been manually replaced in 'lib' folder  and the
> older version is deleted (2.4.1), got this as error
> [image: image.png]
>
> What other location is needed to have the latest jar file.
>
> Thanks
> Ritika
>

Re: Manifoldcf freezes and sit idle

2022-01-31 Thread Karl Wright

As I've mentioned before, the best way to diagnose problems like this is to
get a thread dump of the agents process.  There are many potential reasons
it could occur, ranging from stuck locks to resource starvation.  What
locking model are you using?

Karl

On Mon, Jan 31, 2022 at 6:02 AM ritika jain 
wrote:

> Hi,
>
> I am using Manifoldcf 2.14, web connector and Elastic as output.
> I have observed after a certain time period of continuous run job freezes
> and does not do/process anything. Simple history shows nothing after a
> certain process, and it's not for one job it has been observed for 3
> different jobs , also checked for a certain document (that seems NOT to be
> the case).
>
> Only after restarting the docker container helps. After restarting the job
> continues
>
> What can be the possible reason for this? How it can be prevented for PROD.
>
> Thanks
> Ritika
>

Re: Log4j dependency

2021-12-14 Thread Karl Wright

ManifoldCF framework and connectors use log4j 2.x to dump information to
the ManifoldCF log file.

Please read the following page:

https://logging.apache.org/log4j/2.x/security.html

Specifically, this part:

'Descripton: Apache Log4j2 <=2.14.1 JNDI features used in configuration,
log messages, and parameters do not protect against attacker controlled
LDAP and other JNDI related endpoints. An attacker who can control log
messages or log message parameters can execute arbitrary code loaded from
LDAP servers when message lookup substitution is enabled. From log4j
2.15.0, this behavior has been disabled by default.'

In other words, unless you are allowing external people access to the
crawler UI or to the API, it's impossible to exploit this in ManifoldCF.

However, in the interest of assuring people, we are updating this core
dependency to the recommended 2.15.0 anyway.  The release is scheduled by
the end of December.

Karl

On Tue, Dec 14, 2021 at 4:41 AM ritika jain 
wrote:

> .Hi All,
>
> How does manifold.cf use log4j. When I checked pom.xml of ES connector ,
> it is shown as an *exclusion *of maven dependency.
> [image: image.png]
>
> But when checked in Project's downloaded Dependencies, It shows it being
> used and downloaded.
>
> [image: image.png]
> How does manifold use log 4j and how can we change the version of it.
>
> Thanks
> Ritika
>

Re: Log4j dependency

2021-12-14 Thread Furkan KAMACI

Hi Ritika,

For maven check here:

https://github.com/apache/manifoldcf/blob/trunk/pom.xml#L80

For Ant check here:

https://github.com/apache/manifoldcf/blob/trunk/build.xml#L87

Kind Regards,
Furkan KAMACI

On Tue, Dec 14, 2021 at 12:41 PM ritika jain 
wrote:

> .Hi All,
>
> How does manifold.cf use log4j. When I checked pom.xml of ES connector ,
> it is shown as an *exclusion *of maven dependency.
> [image: image.png]
>
> But when checked in Project's downloaded Dependencies, It shows it being
> used and downloaded.
>
> [image: image.png]
> How does manifold use log 4j and how can we change the version of it.
>
> Thanks
> Ritika
>

Re: Manifoldcf background process

2021-11-18 Thread Karl Wright

The degree of parallelism can be controlled in two ways.
The first way is to set the number of worker threads to something
reasonable.  Usually, this is no more than about 2x the number of
processors you have.
The second way is to control the number of connections in your jcifs
connector to keep it at something reasonable, e.g. 4 (because windows SMB
is really not good at handling more than that anyway).

These two controls are independent of each other.  From your description,
it sounds like the parameter you want to set is not the number of worker
threads but rather the number of connections.  But setting both properly
certainly will help.  The reason that having a high worker thread count is
bad is because you use up some amount of memory for each active thread, and
that means if you give too big a value you need to give ManifoldCF way too
much memory, and you won't be able to compute it in advance either.

Karl

On Thu, Nov 18, 2021 at 2:49 AM ritika jain 
wrote:

> Hi All,
>
> I would like to understand the background process of Manifoldcf windows
> shares jobs , and how it processes the path mentioned in the jobs
> configuration.
>
> I am creating a dynamic job via API using PHP which will pick up approx
> 70k of documents and a dynamic job with  70k of different paths mentioned
> in the job and mention folder-subfolders path otherwise and file name in
> filespec.
>
> My question is, how does manifold work in the background to access all
> different folders at a time. Because mostly all files correspond to
> different folders. Does manifold loads while fetching all folder
> permissions and accessing folder/subfolders files. How does it fetch
> permission for one folder say for path 1 and simultaneously fetch different
> folder permission/access for say path2.
> Does it load the manifold. Because when this job is running then
> manifoldcf seems to be under heavy load and it gets really really slow and
> has to restart the docker container every 15-20 min.
>
> How can a job be run efficiently?
>
> Thanks
> Ritika
>
>

Re: Manifold Job process isssue

2021-11-15 Thread Karl Wright

SMB exceptions with jcifs in the trace tell us that JCIFS couldn't talk to
your windows share server.  That's all we can tell though.

Karl


On Mon, Nov 15, 2021 at 7:24 AM ritika jain 
wrote:

> Hi,
>
> Raising the concern above again, to process only 60k of document (when
> clock issue is fixed too), job process is not progressing , its being stuck
> for like days. So had to restart the docker container every time for it to
> process.
> This time now we are getting this :- Timeout Exception. What we can be the
> reason for it and how it can be fixed .?
>   ... 24 more
> [Worker thread '23'] WARN jcifs.util.transport.Transport - sendrecv failed
> jcifs.util.transport.RequestTimeoutException: Transport40 timedout waiting
> for response to
> command=SMB2_TREE_CONNECT,status=0,flags=0x,mid=4,wordCount=0,byteCount=86
> at
> jcifs.util.transport.Transport.waitForResponses(Transport.java:365)
> at jcifs.util.transport.Transport.sendrecv(Transport.java:232)
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1021)
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1539)
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:409)
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:347)
> at jcifs.smb.SmbTreeImpl.treeConnect(SmbTreeImpl.java:611)
> at
> jcifs.smb.SmbTreeConnection.connectTree(SmbTreeConnection.java:614)
> at
> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:568)
> at
> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:489)
> at jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:465)
> at
> jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:426)
> at jcifs.smb.SmbFile.ensureTreeConnected(SmbFile.java:551)
> at jcifs.smb.SmbFile.length(SmbFile.java:1541)
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileLength(SharedDriveConnector.java:2340)
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector$ProcessDocumentsFilter.accept(SharedDriveConnector.java:4935)
> at
> jcifs.smb.SmbEnumerationUtil$ResourceFilterWrapper.accept(SmbEnumerationUtil.java:331)
> at
> jcifs.smb.FileEntryAdapterIterator.advance(FileEntryAdapterIterator.java:82)
> at
> jcifs.smb.FileEntryAdapterIterator.(FileEntryAdapterIterator.java:52)
> at
> jcifs.smb.DirFileEntryAdapterIterator.(DirFileEntryAdapterIterator.java:37)
> at jcifs.smb.SmbEnumerationUtil.doEnum(SmbEnumerationUtil.java:223)
> at
> jcifs.smb.SmbEnumerationUtil.listFiles(SmbEnumerationUtil.java:279)
> at jcifs.smb.SmbFile.listFiles(SmbFile.java:1273)
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileListFiles(SharedDriveConnector.java:2380)
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:818)
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [Worker thread '23'] WARN jcifs.smb.SmbTransportImpl - Disconnecting
> transport while still in use Transport40[backup002.directory.intra/
> 136.231.158.172:445,state=5,signingEnforced=false,usage=5]:
> [SmbSession[credentials=svc_EScrawl,targetHost=backup002.directory.intra,targetDomain=null,uid=0,connectionState=2,usage=3]]
> [Worker thread '23'] WARN jcifs.smb.SmbSessionImpl - Logging off session
> while still in use
> SmbSession[credentials=svc_EScrawl,targetHost=backup002.directory.intra,targetDomain=null,uid=0,connectionState=3,usage=3]:[SmbTree[share=WINPROJECTS,service=?,tid=-1,inDfs=false,inDomainDfs=false,connectionState=1,usage=1]]
> [Worker thread '10'] WARN jcifs.util.transport.Transport - sendrecv failed
> jcifs.util.transport.RequestTimeoutException: Transport41 timedout waiting
> for response to
> command=SMB2_TREE_CONNECT,status=0,flags=0x,mid=4,wordCount=0,byteCount=80
> at
> jcifs.util.transport.Transport.waitForResponses(Transport.java:365)
> at jcifs.util.transport.Transport.sendrecv(Transport.java:232)
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1021)
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1539)
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:409)
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:347)
> at jcifs.smb.SmbTreeImpl.treeConnect(SmbTreeImpl.java:611)
> at
> jcifs.smb.SmbTreeConnection.connectTree(SmbTreeConnection.java:614)
> at
> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:568)
> at
> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:489)
> at jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:465)
> at
> jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:426)
> at

Re: Manifold Job process isssue

2021-11-15 Thread ritika jain

Hi,

Raising the concern above again, to process only 60k of document (when
clock issue is fixed too), job process is not progressing , its being stuck
for like days. So had to restart the docker container every time for it to
process.
This time now we are getting this :- Timeout Exception. What we can be the
reason for it and how it can be fixed .?
  ... 24 more
[Worker thread '23'] WARN jcifs.util.transport.Transport - sendrecv failed
jcifs.util.transport.RequestTimeoutException: Transport40 timedout waiting
for response to
command=SMB2_TREE_CONNECT,status=0,flags=0x,mid=4,wordCount=0,byteCount=86
at
jcifs.util.transport.Transport.waitForResponses(Transport.java:365)
at jcifs.util.transport.Transport.sendrecv(Transport.java:232)
at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1021)
at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1539)
at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:409)
at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:347)
at jcifs.smb.SmbTreeImpl.treeConnect(SmbTreeImpl.java:611)
at
jcifs.smb.SmbTreeConnection.connectTree(SmbTreeConnection.java:614)
at
jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:568)
at
jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:489)
at jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:465)
at
jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:426)
at jcifs.smb.SmbFile.ensureTreeConnected(SmbFile.java:551)
at jcifs.smb.SmbFile.length(SmbFile.java:1541)
at
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileLength(SharedDriveConnector.java:2340)
at
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector$ProcessDocumentsFilter.accept(SharedDriveConnector.java:4935)
at
jcifs.smb.SmbEnumerationUtil$ResourceFilterWrapper.accept(SmbEnumerationUtil.java:331)
at
jcifs.smb.FileEntryAdapterIterator.advance(FileEntryAdapterIterator.java:82)
at
jcifs.smb.FileEntryAdapterIterator.(FileEntryAdapterIterator.java:52)
at
jcifs.smb.DirFileEntryAdapterIterator.(DirFileEntryAdapterIterator.java:37)
at jcifs.smb.SmbEnumerationUtil.doEnum(SmbEnumerationUtil.java:223)
at
jcifs.smb.SmbEnumerationUtil.listFiles(SmbEnumerationUtil.java:279)
at jcifs.smb.SmbFile.listFiles(SmbFile.java:1273)
at
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileListFiles(SharedDriveConnector.java:2380)
at
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:818)
at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
[Worker thread '23'] WARN jcifs.smb.SmbTransportImpl - Disconnecting
transport while still in use Transport40[backup002.directory.intra/
136.231.158.172:445,state=5,signingEnforced=false,usage=5]:
[SmbSession[credentials=svc_EScrawl,targetHost=backup002.directory.intra,targetDomain=null,uid=0,connectionState=2,usage=3]]
[Worker thread '23'] WARN jcifs.smb.SmbSessionImpl - Logging off session
while still in use
SmbSession[credentials=svc_EScrawl,targetHost=backup002.directory.intra,targetDomain=null,uid=0,connectionState=3,usage=3]:[SmbTree[share=WINPROJECTS,service=?,tid=-1,inDfs=false,inDomainDfs=false,connectionState=1,usage=1]]
[Worker thread '10'] WARN jcifs.util.transport.Transport - sendrecv failed
jcifs.util.transport.RequestTimeoutException: Transport41 timedout waiting
for response to
command=SMB2_TREE_CONNECT,status=0,flags=0x,mid=4,wordCount=0,byteCount=80
at
jcifs.util.transport.Transport.waitForResponses(Transport.java:365)
at jcifs.util.transport.Transport.sendrecv(Transport.java:232)
at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1021)
at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1539)
at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:409)
at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:347)
at jcifs.smb.SmbTreeImpl.treeConnect(SmbTreeImpl.java:611)
at
jcifs.smb.SmbTreeConnection.connectTree(SmbTreeConnection.java:614)
at
jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:568)
at
jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:489)
at jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:465)
at
jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:426)
at jcifs.smb.SmbFile.ensureTreeConnected(SmbFile.java:551)
at jcifs.smb.SmbFile.exists(SmbFile.java:845)
at
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileExists(SharedDriveConnector.java:2220)
at
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:610)
at

Re: Manifold Job process isssue

2021-11-09 Thread Karl Wright

One hour is quite a lot and will wreak havoc on the document queue.
Karl


On Tue, Nov 9, 2021 at 7:08 AM ritika jain  wrote:

> I have checked, there is only one hour time difference between docker
> container and docker host
>
> On Tue, Nov 9, 2021 at 4:41 PM Karl Wright  wrote:
>
>> If your docker image's clock is out of sync badly with the real world,
>> then System.currentTimeMillis() may give bogus values, and ManifoldCF uses
>> that to manage throttling etc.  I don't know if that is the correct
>> explanation but it's the only thing I can think of.
>>
>> Karl
>>
>>
>> On Tue, Nov 9, 2021 at 4:56 AM ritika jain 
>> wrote:
>>
>>>
>>> Hi All,
>>>
>>> I am using window shares connector , manifoldcf 2.14 and ES as output. I
>>> have configured a job to process 60k of documents, Also these documents are
>>> new and do not have corresponding values in DB and ES index.
>>>
>>> So ideally it should process/Index the documents as soon as the job
>>> starts.
>>> But Manifoldcf does not process anything for many hours of job start
>>> up.I have tried restarting the docker container as well. But it didn't help
>>> much. Also logs only correspond to Long running queries.
>>>
>>> Why does the manifold behave like that?
>>>
>>> Thanks
>>> Ritika
>>>
>>

Re: Manifold Job process isssue

2021-11-09 Thread ritika jain

I have checked, there is only one hour time difference between docker
container and docker host

On Tue, Nov 9, 2021 at 4:41 PM Karl Wright  wrote:

> If your docker image's clock is out of sync badly with the real world,
> then System.currentTimeMillis() may give bogus values, and ManifoldCF uses
> that to manage throttling etc.  I don't know if that is the correct
> explanation but it's the only thing I can think of.
>
> Karl
>
>
> On Tue, Nov 9, 2021 at 4:56 AM ritika jain 
> wrote:
>
>>
>> Hi All,
>>
>> I am using window shares connector , manifoldcf 2.14 and ES as output. I
>> have configured a job to process 60k of documents, Also these documents are
>> new and do not have corresponding values in DB and ES index.
>>
>> So ideally it should process/Index the documents as soon as the job
>> starts.
>> But Manifoldcf does not process anything for many hours of job start up.I
>> have tried restarting the docker container as well. But it didn't help
>> much. Also logs only correspond to Long running queries.
>>
>> Why does the manifold behave like that?
>>
>> Thanks
>> Ritika
>>
>

Re: Manifold Job process isssue

2021-11-09 Thread Karl Wright

If your docker image's clock is out of sync badly with the real world, then
System.currentTimeMillis() may give bogus values, and ManifoldCF uses that
to manage throttling etc.  I don't know if that is the correct explanation
but it's the only thing I can think of.

Karl

On Tue, Nov 9, 2021 at 4:56 AM ritika jain  wrote:

>
> Hi All,
>
> I am using window shares connector , manifoldcf 2.14 and ES as output. I
> have configured a job to process 60k of documents, Also these documents are
> new and do not have corresponding values in DB and ES index.
>
> So ideally it should process/Index the documents as soon as the job starts.
> But Manifoldcf does not process anything for many hours of job start up.I
> have tried restarting the docker container as well. But it didn't help
> much. Also logs only correspond to Long running queries.
>
> Why does the manifold behave like that?
>
> Thanks
> Ritika
>

Re: Duplicate key error

2021-10-27 Thread Karl Wright

We see errors like this only because MCF is a highly multithreaded
application, and two threads sometimes are able to collide in what they are
doing even though they are transactionally separated.  That is because of
bugs in the database software.  So if you restart the job it should not
encounter the same problem.

If the problem IS repeatable, we will of course look deeper into what is
going on.

Karl

On Wed, Oct 27, 2021 at 9:52 AM Karl Wright  wrote:

> Is it repeatable?  My guess is it is not repeatable.
> Karl
>
> On Wed, Oct 27, 2021 at 4:43 AM ritika jain 
> wrote:
>
>> So , it can be left as it is.. ? because it is preventing job to complete
>> and its stopping.
>>
>> On Tue, Oct 26, 2021 at 8:40 PM Karl Wright  wrote:
>>
>>> That's a database bug.  All of our underlying databases have some bugs
>>> of this kind.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Oct 26, 2021 at 9:17 AM ritika jain 
>>> wrote:
>>>
 Hi All,

 While using Manifoldcf 2.14 with Web connector and ES connector. After
 a certain time of continuing the job (jobs ingest some documents in lakhs),
 we got this error on PROD.

 Can anybody suggest what could be the problem?

 PRODUCTION MANIFOLD ERROR:

 Error: ERROR: duplicate key value violates unique constraint
 "ingeststatus_pkey" Detail: Key (id)=(1624***7) already exists.

 Thanks

Re: Duplicate key error

2021-10-27 Thread Karl Wright

Is it repeatable?  My guess is it is not repeatable.
Karl

On Wed, Oct 27, 2021 at 4:43 AM ritika jain 
wrote:

> So , it can be left as it is.. ? because it is preventing job to complete
> and its stopping.
>
> On Tue, Oct 26, 2021 at 8:40 PM Karl Wright  wrote:
>
>> That's a database bug.  All of our underlying databases have some bugs of
>> this kind.
>>
>> Karl
>>
>>
>> On Tue, Oct 26, 2021 at 9:17 AM ritika jain 
>> wrote:
>>
>>> Hi All,
>>>
>>> While using Manifoldcf 2.14 with Web connector and ES connector. After a
>>> certain time of continuing the job (jobs ingest some documents in lakhs),
>>> we got this error on PROD.
>>>
>>> Can anybody suggest what could be the problem?
>>>
>>> PRODUCTION MANIFOLD ERROR:
>>>
>>> Error: ERROR: duplicate key value violates unique constraint
>>> "ingeststatus_pkey" Detail: Key (id)=(1624***7) already exists.
>>>
>>>
>>> Thanks
>>>
>>>
>>>

Re: Duplicate key error

2021-10-27 Thread ritika jain

So , it can be left as it is.. ? because it is preventing job to complete
and its stopping.

On Tue, Oct 26, 2021 at 8:40 PM Karl Wright  wrote:

> That's a database bug.  All of our underlying databases have some bugs of
> this kind.
>
> Karl
>
>
> On Tue, Oct 26, 2021 at 9:17 AM ritika jain 
> wrote:
>
>> Hi All,
>>
>> While using Manifoldcf 2.14 with Web connector and ES connector. After a
>> certain time of continuing the job (jobs ingest some documents in lakhs),
>> we got this error on PROD.
>>
>> Can anybody suggest what could be the problem?
>>
>> PRODUCTION MANIFOLD ERROR:
>>
>> Error: ERROR: duplicate key value violates unique constraint
>> "ingeststatus_pkey" Detail: Key (id)=(1624***7) already exists.
>>
>>
>> Thanks
>>
>>
>>

Re:

2021-10-26 Thread Karl Wright

That's a database bug.  All of our underlying databases have some bugs of
this kind.

Karl


On Tue, Oct 26, 2021 at 9:17 AM ritika jain 
wrote:

> Hi All,
>
> While using Manifoldcf 2.14 with Web connector and ES connector. After a
> certain time of continuing the job (jobs ingest some documents in lakhs),
> we got this error on PROD.
>
> Can anybody suggest what could be the problem?
>
> PRODUCTION MANIFOLD ERROR:
>
> Error: ERROR: duplicate key value violates unique constraint
> "ingeststatus_pkey" Detail: Key (id)=(1624***7) already exists.
>
>
> Thanks
>
>
>

Re: Windows Shares job-Limit on defining no of paths

2021-10-25 Thread Karl Wright

The only limit is that the more you add, the slower it gets.

Karl


On Mon, Oct 25, 2021 at 6:06 AM ritika jain 
wrote:

> Hi ,
> Is there any limit on the number of paths we can define in job using
> Repository as Window Shares and ES as Output
>
> Thanks
>

Re: Null Pointer Exception

2021-10-25 Thread Karl Wright

The API should really catch this situation.  Basically, you are calling a
function that requires an input but you are not providing one.  In that
case the API sets the input to "null", and the detailed operation is
called.  The detailed operation is not expecting a null input.

This is API piece that is not flagging the error properly:

// Parse the input
Configuration input;

if (protocol.equals("json"))
{
  if (argument.length() != 0)
  {
input = new Configuration();
input.fromJSON(argument);
  }
  else
input = null;
}
else
{
  response.sendError(response.SC_BAD_REQUEST,"Unknown API protocol:
"+protocol);
  return;
}

Since this is POST, it should assume that the input cannot be null, and if
it is, it's a bad request.

Karl


On Mon, Oct 25, 2021 at 2:44 AM ritika jain 
wrote:

> Hi,
>
> I am getting Null pointer exceptions while creating a job programmatic
> approach via PHP.
> Can anybody suggest the reason for this?.
>
>Error 500 Server Error 
> HTTP ERROR 500 Problem accessing
> /mcf-api-service/json/jobs. Reason:  Server ErrorCaused
> by:java.lang.NullPointerException at
> org.apache.manifoldcf.agents.system.ManifoldCF.findConfigurationNode(ManifoldCF.java:208)
> at
> org.apache.manifoldcf.crawler.system.ManifoldCF.apiPostJob(ManifoldCF.java:3539)
> at
> org.apache.manifoldcf.crawler.system.ManifoldCF.executePostCommand(ManifoldCF.java:3585)
> at
> org.apache.manifoldcf.apiservlet.APIServlet.executePost(APIServlet.java:576)
> at org.apache.manifoldcf.apiservlet.APIServlet.doPost(APIServlet.java:175)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769) at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.eclipse.jetty.server.Server.handle(Server.java:497) at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311) at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
> at
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
> at java.lang.Thread.run(Thread.java:748)  Powered by
> Jetty://  
>
>

Re: Error: Repeated service interruptions - failure processing document: Read timed out

2021-09-30 Thread Karl Wright

Hi,

You say this is a "Tika error".  Is this Tika as a stand-alone service?  I
do not recognize any ManifoldCF classes whatsoever in this thread dump.

If this is Tika, I suggest contacting the Tika team.

Karl


On Thu, Sep 30, 2021 at 3:02 AM Bisonti Mario 
wrote:

> Additional info.
>
>
>
> I am using 2.17-dev version
>
>
>
>
>
>
>
> *Da:* Bisonti Mario
> *Inviato:* martedì 28 settembre 2021 17:01
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Error: Repeated service interruptions - failure processing
> document: Read timed out
>
>
>
> Hello
>
>
>
> I have error on a Job that parses a network folder.
>
>
>
> This is the tika error:
> 2021-09-28 16:14:50 INFO  Server:415 - Started @1367ms
>
> 2021-09-28 16:14:50 WARN  ContextHandler:1671 - Empty contextPath
>
> 2021-09-28 16:14:50 INFO  ContextHandler:916 - Started
> o.e.j.s.h.ContextHandler@3dd69f5a{/,null,AVAILABLE}
>
> 2021-09-28 16:14:50 INFO  TikaServerCli:413 - Started Apache Tika server
> at http://sengvivv02.vimar.net:9998/
>
> 2021-09-28 16:15:04 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:26:46 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:26:46 INFO  TikaResource:484 - tika (application/pdf)
>
> 2021-09-28 16:27:23 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:27:24 INFO  TikaResource:484 - tika (application/pdf)
>
> 2021-09-28 16:27:26 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:27:26 INFO  TikaResource:484 - tika (application/pdf)
>
> 2021-09-28 16:30:28 WARN  PhaseInterceptorChain:468 - Interceptor for {
> http://resource.server.tika.apache.org/}MetadataResource has thrown
> exception, unwinding now
>
> org.apache.cxf.interceptor.Fault: Could not send Message.
>
> at
> org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:67)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>
> at org.eclipse.jetty.server.Server.handle(Server.java:516)
>
> at
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
>
> at
> org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
>
> at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
>
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
>
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>
> at
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
>
> at
> org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
>
> at java.base/java.lang.Thread.run(Thread.java:834)
>
> Caused by: org.eclipse.jetty.io.EofException
>
> at
> org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279)
>
> at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422)
>
> at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:277)
>
> at
> org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
>
> at
> org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:826)
>
> at
>

Re: Tika Parser Issue

2021-09-07 Thread Karl Wright

This is something you should contact the Tika project about.
Karl


On Tue, Sep 7, 2021 at 8:46 AM ritika jain  wrote:

> Hi All,
>
> I am using tika-core 1.21 and tika-parsers 1.21 jar files as tika
> dependencies in Manifoldcf 2.14 version.
> Getting some issues while parsing *PDF *files. Some strange characters
> appeared, tried changing Tika jar files version also 1.24 and 1.27 (it
> didn't even extract files correctly).
>
> [image: 365.jfif]
> Also checked with the document content, it seems to be fine.
> Can anybody help me on this.
>
> Thanks
> Ritika
>

Re: ZooKeeper leaking or does not handle temporary network failures

2021-08-26 Thread Raman Gupta

I'm having issues with ManifoldCF losing connection to ZooKeeper. This is
easily repeatable: I just need to leave ManifoldCF running for a few days.
The results are not always "No route to host" as I previously reported --
sometimes its just connect timeouts or other behavior, but the connection
is always lost after a few days. I have several other projects running in
the same environment connecting to ZooKeeper and none of them experience
any issues at all, so this is something specific to MCF and not the
environment or ZK.

Is there any way I can at least debug this so I can provide better
information on which to file an issue?

Thanks,
Raman

On Mon, Apr 12, 2021 at 3:33 PM Raman Gupta  wrote:

> After my agent process has been up for a while, it starts producing these
> logs repeatedly:
>
> java.net.NoRouteToHostException: No route to host
> at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelIm
> at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocke
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
>
> The ZooKeeper cluster itself is up and responding normally, and the
> network is available, and the zookeeper host resolves, though a transient
> lookup failure cannot be ruled out, as ManifoldCF and Zookeeper are running
> inside Kubernetes. A restart of the agent promptly fixes the issue.
>
> Regards,
> Raman
>
>

Re: Query:JCIFS connector

2021-08-23 Thread Karl Wright

I have a work day today, with limited time.
The UI is what it is; it does not have capabilities beyond what is stated
in the UI and in the manual.  It's meant to allow construction of paths
piece by piece, not by full subdirectory at a time.

You can obviously use the API if you want to construct path specifications
some other way.  It sounds like you are doing things programmatically
anyway so I would definitely look into that.

Karl

On Mon, Aug 23, 2021 at 3:52 AM ritika jain 
wrote:

> Can anybody have a clue on this ?
>
> On Fri, Aug 20, 2021 at 12:33 PM ritika jain 
> wrote:
>
>> Hi All,
>>
>> I am having a query , is there any way using which we can mention
>> subdirectories' path in the file spec of Window shares connector.
>>
>> Like my requirement is to mention Most top hierarchical folder on top as
>> mentioned in Screenshot.
>> And in file spec requirement is  to mention file name followed by
>> subdirectories.
>> *Say for example there is a file *
>> E:\sharing\demo\Index.pdf
>>
>> Requirement is to mention sharing at top and rest path at file spec.
>>
>> [image: image.png]
>>
>> Is there any way to do it? Any help would be appreciated
>>
>> Thanks
>> Ritika
>>
>>

Re: Query:JCIFS connector

2021-08-23 Thread ritika jain

Can anybody have a clue on this ?

On Fri, Aug 20, 2021 at 12:33 PM ritika jain 
wrote:

> Hi All,
>
> I am having a query , is there any way using which we can mention
> subdirectories' path in the file spec of Window shares connector.
>
> Like my requirement is to mention Most top hierarchical folder on top as
> mentioned in Screenshot.
> And in file spec requirement is  to mention file name followed by
> subdirectories.
> *Say for example there is a file *
> E:\sharing\demo\Index.pdf
>
> Requirement is to mention sharing at top and rest path at file spec.
>
> [image: image.png]
>
> Is there any way to do it? Any help would be appreciated
>
> Thanks
> Ritika
>
>

Re: Job Deletion query

2021-08-12 Thread Karl Wright

Yes, when you delete a job, the indexed documents associated with that job
are removed from the index.

ManifoldCF is a synchronizer, not a crawler, so when you remove the
synchronization job then if it didn't delete the indexed documents they
would be left dangling.

Karl

On Thu, Aug 12, 2021 at 3:46 AM ritika jain 
wrote:

> Hi All,
>
> When we delete a job in Manifoldcf .. Does it also delete the indexed
> documents via that job from Elastic index as well ?
>
> I understand that when a job is deleted from Manifoldcf interface it will
> delete all the referenced documents via that job from postgres. But why is
> it deleted from the ES index?
>
> Thanks
> Ritika
>

Re: Window shares dynamic Job issue

2021-08-11 Thread ritika jain

Seems to be working now!!! Thanks a lot !!!

On Wed, Aug 11, 2021 at 6:22 PM ritika jain 
wrote:

> Hi ,
>
> Yes this works only the  difference is when a single file is ingested we
> are having ingested one as C:/Users/Dell/Desktop/abc.txt/.-with a UNWANTED
> slash at end
>
> *The file spec part should include the file name.:- *This way I have
> tried, I am getting Access denied. Also checked about all the Access is
> granted to the user who is accessing
>
> On Wed, Aug 11, 2021 at 4:43 PM Karl Wright  wrote:
>
>> The "path" attribute is not meant to include terminal file names, only
>> directories.  I'm surprised that this works at all.  The file spec part
>> should include the file name.
>>
>> Karl
>>
>>
>> On Wed, Aug 11, 2021 at 2:14 AM ritika jain 
>> wrote:
>>
>>> *Dynamic Job *
>>>
>>> {"job":{"_children_":[{"_type_":"id","_value_":"1628595470228"},{"_type_":"description","_value_":"DEMo
>>>  TEMP 
>>> API-1628595484"},{"_type_":"repository_connection","_value_":"Demo_Repo"},{"_type_":"document_specification","_children_":[{"_type_":"startpoint","include":[{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pdf","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.doc","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dotx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dot","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docm","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.ppt","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pptx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wpd","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp5","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp4","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp6","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp7","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.png","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpeg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.gif","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.bmp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsm","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xls","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.doc","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpeg","_value_":"","_attribute_type":"file"},{"_attribute_filespec":"*","_value_":"","_attribute_type":"directory"}],"_attribute_path":*"windows\/Job\/Demo
>>>  School 
>>>

Re: Window shares dynamic Job issue

2021-08-11 Thread ritika jain

Hi ,

Yes this works only the  difference is when a single file is ingested we
are having ingested one as C:/Users/Dell/Desktop/abc.txt/.-with a UNWANTED
slash at end

*The file spec part should include the file name.:- *This way I have tried,
I am getting Access denied. Also checked about all the Access is granted to
the user who is accessing

On Wed, Aug 11, 2021 at 4:43 PM Karl Wright  wrote:

> The "path" attribute is not meant to include terminal file names, only
> directories.  I'm surprised that this works at all.  The file spec part
> should include the file name.
>
> Karl
>
>
> On Wed, Aug 11, 2021 at 2:14 AM ritika jain 
> wrote:
>
>> *Dynamic Job *
>>
>> {"job":{"_children_":[{"_type_":"id","_value_":"1628595470228"},{"_type_":"description","_value_":"DEMo
>>  TEMP 
>> API-1628595484"},{"_type_":"repository_connection","_value_":"Demo_Repo"},{"_type_":"document_specification","_children_":[{"_type_":"startpoint","include":[{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pdf","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.doc","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dotx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dot","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docm","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.ppt","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pptx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wpd","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp5","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp4","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp6","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp7","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.png","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpeg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.gif","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.bmp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsm","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xls","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.doc","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpeg","_value_":"","_attribute_type":"file"},{"_attribute_filespec":"*","_value_":"","_attribute_type":"directory"}],"_attribute_path":*"windows\/Job\/Demo
>>  School 
>> Network\/Information\/restpuntion.docx"*,"_value_":""},{"_type_":"maxlength","_value_":"","_attribute_value":"200"},{"_type_":"security","_value_":"","_attribute_value":"on"},{"_type_":"sharesecurity","_value_":"","_attribute_value":"on"},{"_type_":"parentfoldersecurity","_value_":"","_attribute_value":"on"}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"0"},{"_type_":"stage_isoutput","_value_":"false"},{"_type_":"stage_connectionname","_value_":"Tika"},{"_type_":"stage_specification","_children_":[{"_type_":"keepAllMetadata","_value_":"","_attribute_value":"true"},{"_type_":"ignoreException","_value_":"","_attribute_value":"true"},{"_type_":"lowerNames","_value_":"","_attribute_value":"false"},{"_type_":"writeLimit","_value_":"","_attribute_value":""},{"_type_":"boilerplateprocessor","_value_":"","_attribute_value":"de.l3s.boilerpipe.extractors.KeepEverythingExtractor"}]}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"1"},{"_type_":"stage_prerequisite","_value_":"0"},{"_type_":"stage_isoutput","_value_":"false"},{"_type_":"stage_connectionname","_value_":"Metadata
>>  
>>

Re: Window shares dynamic Job issue

2021-08-11 Thread Karl Wright

The "path" attribute is not meant to include terminal file names, only
directories.  I'm surprised that this works at all.  The file spec part
should include the file name.

Karl


On Wed, Aug 11, 2021 at 2:14 AM ritika jain 
wrote:

> *Dynamic Job *
>
> {"job":{"_children_":[{"_type_":"id","_value_":"1628595470228"},{"_type_":"description","_value_":"DEMo
>  TEMP 
> API-1628595484"},{"_type_":"repository_connection","_value_":"Demo_Repo"},{"_type_":"document_specification","_children_":[{"_type_":"startpoint","include":[{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pdf","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.doc","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dotx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dot","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docm","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.ppt","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pptx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wpd","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp5","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp4","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp6","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp7","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.png","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpeg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.gif","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.bmp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsm","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xls","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.doc","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpeg","_value_":"","_attribute_type":"file"},{"_attribute_filespec":"*","_value_":"","_attribute_type":"directory"}],"_attribute_path":*"windows\/Job\/Demo
>  School 
> Network\/Information\/restpuntion.docx"*,"_value_":""},{"_type_":"maxlength","_value_":"","_attribute_value":"200"},{"_type_":"security","_value_":"","_attribute_value":"on"},{"_type_":"sharesecurity","_value_":"","_attribute_value":"on"},{"_type_":"parentfoldersecurity","_value_":"","_attribute_value":"on"}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"0"},{"_type_":"stage_isoutput","_value_":"false"},{"_type_":"stage_connectionname","_value_":"Tika"},{"_type_":"stage_specification","_children_":[{"_type_":"keepAllMetadata","_value_":"","_attribute_value":"true"},{"_type_":"ignoreException","_value_":"","_attribute_value":"true"},{"_type_":"lowerNames","_value_":"","_attribute_value":"false"},{"_type_":"writeLimit","_value_":"","_attribute_value":""},{"_type_":"boilerplateprocessor","_value_":"","_attribute_value":"de.l3s.boilerpipe.extractors.KeepEverythingExtractor"}]}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"1"},{"_type_":"stage_prerequisite","_value_":"0"},{"_type_":"stage_isoutput","_value_":"false"},{"_type_":"stage_connectionname","_value_":"Metadata
>  
>

Re: Window shares dynamic Job issue

2021-08-11 Thread ritika jain

*Dynamic Job *

{"job":{"_children_":[{"_type_":"id","_value_":"1628595470228"},{"_type_":"description","_value_":"DEMo
TEMP 
API-1628595484"},{"_type_":"repository_connection","_value_":"Demo_Repo"},{"_type_":"document_specification","_children_":[{"_type_":"startpoint","include":[{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pdf","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.doc","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dotx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dot","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docm","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.ppt","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pptx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wpd","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp5","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp4","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp6","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp7","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.png","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpeg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.gif","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.bmp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsm","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xls","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.doc","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpeg","_value_":"","_attribute_type":"file"},{"_attribute_filespec":"*","_value_":"","_attribute_type":"directory"}],"_attribute_path":*"windows\/Job\/Demo
School 
Network\/Information\/restpuntion.docx"*,"_value_":""},{"_type_":"maxlength","_value_":"","_attribute_value":"200"},{"_type_":"security","_value_":"","_attribute_value":"on"},{"_type_":"sharesecurity","_value_":"","_attribute_value":"on"},{"_type_":"parentfoldersecurity","_value_":"","_attribute_value":"on"}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"0"},{"_type_":"stage_isoutput","_value_":"false"},{"_type_":"stage_connectionname","_value_":"Tika"},{"_type_":"stage_specification","_children_":[{"_type_":"keepAllMetadata","_value_":"","_attribute_value":"true"},{"_type_":"ignoreException","_value_":"","_attribute_value":"true"},{"_type_":"lowerNames","_value_":"","_attribute_value":"false"},{"_type_":"writeLimit","_value_":"","_attribute_value":""},{"_type_":"boilerplateprocessor","_value_":"","_attribute_value":"de.l3s.boilerpipe.extractors.KeepEverythingExtractor"}]}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"1"},{"_type_":"stage_prerequisite","_value_":"0"},{"_type_":"stage_isoutput","_value_":"false"},{"_type_":"stage_connectionname","_value_":"Metadata

Re: Window shares dynamic Job issue

2021-08-10 Thread Karl Wright

I am sorry, but I'm having trouble understanding how exactly you are
configuring the JCIFS connector in these two cases.Can you view the job
in each case and provide cut-and-paste of the view?

Karl


On Tue, Aug 10, 2021 at 9:09 AM ritika jain 
wrote:

> Hi All,
>
> I am using Window shares connector in 2.14 manifoldcf version and Elastic
> as output.
> I have created a dynamic manifoldcf job API via which a job will be
> created in manifoldcf with inclusions list and path, only particular file
> path is to be mentioned . Example file path:- C:/Users/Dell/Desktop/abc.txt.
>
> A job will be created to crawl only this single file .
> *Issue is :-*
> When this job ingest document in Elastic search  there is slash, that is
> getting appended in the end
>
> *Ingested file is* :- C:/Users/Dell/Desktop/abc.txt/
>
> But when same file is crawled via Manifoldcf job settings by mentioning
> path till folder structure (as manual job creation does not allow file path
> till particular file it allows till folders only).
> It does not append /
>
> *Ingested file in this case:-*
> C:/Users/Dell/Desktop/abc.txt
> as expected original file.
>
> *Query*
> Why is this the case as it makes searching in ES ambiguous.
>
> Thanks
> Ritika
>
>
>

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread h0444xk8

 I had a quick look at Jira. I think there is already a ticket which
covers the reqirement of using a sitemap.xml file which is referenced by
robots.txt

https://issues.apache.org/jira/browse/CONNECTORS-1657

I'll update this ticket with infos from the sitemap protocol page
https://www.sitemaps.org/protocol.html#submit_robots

I'll try to add other tickets to require the functionality to send the
sitemap directly to the 'search engine' (in our case this role is played
by the ManifoldCF Web connector)

Sebastian

Am 07.07.2021 16:00 schrieb Karl Wright: 

> If you wish to add a feature request, please create a CONNECTORS ticket that 
> describes the functionality you think the connector should have. 
> 
> Karl 
> 
> On Wed, Jul 7, 2021 at 9:29 AM h0444xk8  wrote: 
> 
>> Hi,
>> 
>> yes, that seems to be the reason. In:
>> 
>> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java
>>  [1]
>> 
>> there is the following code sequence:
>> 
>> else if (lowercaseLine.startsWith("sitemap:"))
>> {
>> // We don't complain about this, but right now we don't 
>> listen to it either.
>> }
>> 
>> But if I have a look at:
>> 
>> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
>>  [2]
>> 
>> a sitemap containing an urlset seems to be handled
>> 
>> else if (localName.equals("urlset") || localName.equals("sitemapindex"))
>> {
>> // Sitemap detected
>> outerTagCount++;
>> return new 
>> UrlsetContextClass(theStream,namespace,localName,qName,atts,documentURI,handler);
>> }
>> 
>> So, my question is: is there another way to handle sitemaps inside the 
>> Web Crawler?
>> 
>> Cheers Sebastian
>> 
>> Am 07.07.2021 12:23 schrieb Karl Wright:
>> 
>>> The robots parsing does not recognize the "sitemaps" line, which was 
>>> likely not in the spec for robots when this connector was written.
>>> 
>>> Karl
>>> 
>>> On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:
>>> 
 Hi,
 
 I have a general question. Is the Web connector supporting sitemap 
 files
 referenced by the robots.txt? In my use case the robots.txt is stored 
 in
 the root of the website and is referencing two compressed sitemaps.
 
 Example of robots.txt
 
 User-Agent: *
 Disallow:
 Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz [3] [1]
 Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz [4] [2]
 
 When start crawling in „Simple History" there is an error log entry as
 follows:
 
 Unknown robots.txt line: 'Sitemap:
 https://www.example.de/sitemap/en-sitemap.xml.gz [4] [2]'
 
 Is there a general problem with sitemaps at all or with sitemaps
 referenced in robots.txt or with compressed sitemaps?
 
 Best regards
 
 Sebastian
>> 
>> Links:
>> --
>> [1] https://www.example.de/sitemap/de-sitemap.xml.gz [3]
>> [2] https://www.example.de/sitemap/en-sitemap.xml.gz [4]
 

Links:
--
[1]
https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java
[2]
https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
[3] https://www.example.de/sitemap/de-sitemap.xml.gz
[4] https://www.example.de/sitemap/en-sitemap.xml.gz

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright

If you wish to add a feature request, please create a CONNECTORS ticket
that describes the functionality you think the connector should have.

Karl


On Wed, Jul 7, 2021 at 9:29 AM h0444xk8  wrote:

> Hi,
>
> yes, that seems to be the reason. In:
>
>
> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java
>
> there is the following code sequence:
>
> else if (lowercaseLine.startsWith("sitemap:"))
>{
>  // We don't complain about this, but right now we don't
> listen to it either.
>}
>
> But if I have a look at:
>
>
> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
>
> a sitemap containing an urlset seems to be handled
>
> else if (localName.equals("urlset") || localName.equals("sitemapindex"))
>{
>  // Sitemap detected
>  outerTagCount++;
>  return new
>
> UrlsetContextClass(theStream,namespace,localName,qName,atts,documentURI,handler);
>}
>
> So, my question is: is there another way to handle sitemaps inside the
> Web Crawler?
>
> Cheers Sebastian
>
>
>
>
>
> Am 07.07.2021 12:23 schrieb Karl Wright:
>
> > The robots parsing does not recognize the "sitemaps" line, which was
> > likely not in the spec for robots when this connector was written.
> >
> > Karl
> >
> > On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:
> >
> >> Hi,
> >>
> >> I have a general question. Is the Web connector supporting sitemap
> >> files
> >> referenced by the robots.txt? In my use case the robots.txt is stored
> >> in
> >> the root of the website and is referencing two compressed sitemaps.
> >>
> >> Example of robots.txt
> >> 
> >> User-Agent: *
> >> Disallow:
> >> Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz [1]
> >> Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz [2]
> >>
> >> When start crawling in „Simple History" there is an error log entry as
> >> follows:
> >>
> >> Unknown robots.txt line: 'Sitemap:
> >> https://www.example.de/sitemap/en-sitemap.xml.gz [2]'
> >>
> >> Is there a general problem with sitemaps at all or with sitemaps
> >> referenced in robots.txt or with compressed sitemaps?
> >>
> >> Best regards
> >>
> >> Sebastian
>
>
> Links:
> --
> [1] https://www.example.de/sitemap/de-sitemap.xml.gz
> [2] https://www.example.de/sitemap/en-sitemap.xml.gz
>

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread h0444xk8


Hi,

yes, that seems to be the reason. In:

https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java

there is the following code sequence:

else if (lowercaseLine.startsWith("sitemap:"))
  {
// We don't complain about this, but right now we don't 
listen to it either.

  }

But if I have a look at:

https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java

a sitemap containing an urlset seems to be handled

else if (localName.equals("urlset") || localName.equals("sitemapindex"))
  {
// Sitemap detected
outerTagCount++;
return new 
UrlsetContextClass(theStream,namespace,localName,qName,atts,documentURI,handler);

  }

So, my question is: is there another way to handle sitemaps inside the 
Web Crawler?


Cheers Sebastian





Am 07.07.2021 12:23 schrieb Karl Wright:

The robots parsing does not recognize the "sitemaps" line, which was 
likely not in the spec for robots when this connector was written.


Karl

On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:


Hi,

I have a general question. Is the Web connector supporting sitemap 
files
referenced by the robots.txt? In my use case the robots.txt is stored 
in

the root of the website and is referencing two compressed sitemaps.

Example of robots.txt

User-Agent: *
Disallow:
Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz [1]
Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz [2]

When start crawling in „Simple History" there is an error log entry as
follows:

Unknown robots.txt line: 'Sitemap:
https://www.example.de/sitemap/en-sitemap.xml.gz [2]'

Is there a general problem with sitemaps at all or with sitemaps
referenced in robots.txt or with compressed sitemaps?

Best regards

Sebastian



Links:
--
[1] https://www.example.de/sitemap/de-sitemap.xml.gz
[2] https://www.example.de/sitemap/en-sitemap.xml.gz

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright

The robots parsing does not recognize the "sitemaps" line, which was likely
not in the spec for robots when this connector was written.

Karl


On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:

> Hi,
>
> I have a general question. Is the Web connector supporting sitemap files
> referenced by the robots.txt? In my use case the robots.txt is stored in
> the root of the website and is referencing two compressed sitemaps.
>
> Example of robots.txt
> 
> User-Agent: *
> Disallow:
> Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz
> Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz
>
> When start crawling in „Simple History" there is an error log entry as
> follows:
>
> Unknown robots.txt line: 'Sitemap:
> https://www.example.de/sitemap/en-sitemap.xml.gz'
>
> Is there a general problem with sitemaps at all or with sitemaps
> referenced in robots.txt or with compressed sitemaps?
>
> Best regards
>
> Sebastian
>

Re: Manifoldcf Redirection process

2021-05-28 Thread Karl Wright

302 does get recognized as a redirection, yes


On Fri, May 28, 2021 at 5:07 AM ritika jain 
wrote:

> Is the process the same when fetch/process status code returned is 302  ?
  When running a job with web crawler and ES output connector

>>>
> can anybody have a clue about  this
>

Re: Manifoldcf Redirection process

2021-05-28 Thread ritika jain

>
> Is the process the same when fetch/process status code returned is 302  ?
>>>  When running a job with web crawler and ES output connector
>>>
>>
can anybody have a clue about  this

Re: Manifoldcf Redirection process

2021-05-20 Thread ritika jain

Is the process the same when fetch/process status code returned is 302  ?
 When running a job with web crawler and ES output connector

On Wed, May 19, 2021 at 10:35 PM Karl Wright  wrote:

> ManifoldCF reads all the URLs on its queue.
> If it's a 301, it detects this and pushes the new URL onto the document
> queue.
> When it gets to the new URL, it processes it like any other.
>
> Karl
>
>
> On Wed, May 19, 2021 at 8:32 AM ritika jain 
> wrote:
>
>> Hi
>>
>> I want to understand the process of "How does manifold.cf handles
>> redirection of URL." in case of web crawler connector
>>
>> If there is a page redirect (through a 301) to another URL, then the next
>> crawl will detect the redirect and index the new (final) URL and display it
>> in the search results. (instead of the old URL that redirects). Just as is
>> also done by search engines like Google / Bing.
>>
>> Is that true, , what manifold is capable of avoiding the URL that is 301
>> and pick the URL to which it is redirected? and ingest that URL .
>>
>> If not , what process Manifoldcf follows to inges redirection of URL's
>>
>> Thanks
>> Ritika
>>
>>
>>

Re: Manifoldcf Redirection process

2021-05-19 Thread Karl Wright

ManifoldCF reads all the URLs on its queue.
If it's a 301, it detects this and pushes the new URL onto the document
queue.
When it gets to the new URL, it processes it like any other.

Karl


On Wed, May 19, 2021 at 8:32 AM ritika jain 
wrote:

> Hi
>
> I want to understand the process of "How does manifold.cf handles
> redirection of URL." in case of web crawler connector
>
> If there is a page redirect (through a 301) to another URL, then the next
> crawl will detect the redirect and index the new (final) URL and display it
> in the search results. (instead of the old URL that redirects). Just as is
> also done by search engines like Google / Bing.
>
> Is that true, , what manifold is capable of avoiding the URL that is 301
> and pick the URL to which it is redirected? and ingest that URL .
>
> If not , what process Manifoldcf follows to inges redirection of URL's
>
> Thanks
> Ritika
>
>
>

Re: Interrupted while acquiring credits

2021-05-14 Thread Karl Wright

"crashing the manifold" is probably running out of memory, and it is
probably due to having too many worker threads and insufficient memory, not
the error you found.

If that error caused a problem, it would simply abort the job, not "crash"
Manifold.

Karl


On Fri, May 14, 2021 at 4:10 AM ritika jain 
wrote:

> It retries for 3 times and it usually crashes the manifoldcf.
>
> Similar ticket i observed
> https://issues.apache.org/jira/browse/CONNECTORS-1633, does manifoldcf
> itself capable of skipping the file that cause issue instead of aborting
> the job  or crashing manifold
>
> On Fri, May 14, 2021 at 1:34 PM Karl Wright  wrote:
>
>> '
>>
>> *JCIFS: Possibly transient exception detected on attempt 1 while getting
>> share security'Yes, it is going to retry.*
>>
>> *Karl*
>>
>> On Fri, May 14, 2021 at 1:45 AM ritika jain 
>> wrote:
>>
>>> Hi,
>>> I am using Windows shares connector in manifoldcf 2.14 and ElasticSearch
>>> connector as Output connector and Tika and Metadata adjuster as
>>> Transformation connector
>>>
>>> Trying to crawl the files from SMB server with 64 GB of server and Start
>>> option file of manifoldcf is being given 32GB of memory
>>>  But many times got different errors while processing documents:-
>>> *1) Access is denied*
>>> *2) ... 23 more*
>>>
>>>
>>> * WARN 2021-05-13T12:33:16,318 (Worker thread '6') - Service
>>> interruption reported for job 1599130705168 connection 'Themas_Repo':
>>> Timeout or other service interruption: Interrupted while acquiring
>>> credits WARN 2021-05-13T12:33:17,315 (Worker thread '6') - JCIFS: Possibly
>>> transient exception detected on attempt 1 while getting share security:
>>> Interrupted while acquiring creditsjcifs.smb.SmbException: Interrupted
>>> while acquiring credits*
>>> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1530)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbSessionImpl.sessionSetupSMB2(SmbSessionImpl.java:549)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbSessionImpl.sessionSetup(SmbSessionImpl.java:483)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:369)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:347)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbTreeImpl.treeConnect(SmbTreeImpl.java:607)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectTree(SmbTreeConnection.java:609)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:563)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:484)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:460)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:421)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbFile.ensureTreeConnected(SmbFile.java:551)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbPipeHandleImpl.ensureTreeConnected(SmbPipeHandleImpl.java:111)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbPipeHandleImpl.ensureOpen(SmbPipeHandleImpl.java:166)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbPipeHandleImpl.sendrecv(SmbPipeHandleImpl.java:250)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.dcerpc.DcerpcPipeHandle.doSendReceiveFragment(DcerpcPipeHandle.java:113)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:243)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:216)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:234)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2337)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2500)
>>> [mcf-jcifs-connector.jar:2.14]
>>> at
>>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1261)
>>> [mcf-jcifs-connector.jar:2.14]
>>> at
>>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:647)
>>> [mcf-jcifs-connector.jar:2.14]
>>> at
>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>>> [mcf-pull-agent.jar:?]
>>> Caused by: java.io.InterruptedIOException: Interrupted while acquiring
>>> credits
>>> at
>>> jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:976) ~[?:?]
>>> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
>>> ~[?:?]
>>> ... 23 more
>>> Caused by: java.lang.InterruptedException
>>> at
>>>

Re: Interrupted while acquiring credits

2021-05-14 Thread ritika jain

It retries for 3 times and it usually crashes the manifoldcf.

Similar ticket i observed
https://issues.apache.org/jira/browse/CONNECTORS-1633, does manifoldcf
itself capable of skipping the file that cause issue instead of aborting
the job  or crashing manifold

On Fri, May 14, 2021 at 1:34 PM Karl Wright  wrote:

> '
>
> *JCIFS: Possibly transient exception detected on attempt 1 while getting
> share security'Yes, it is going to retry.*
>
> *Karl*
>
> On Fri, May 14, 2021 at 1:45 AM ritika jain 
> wrote:
>
>> Hi,
>> I am using Windows shares connector in manifoldcf 2.14 and ElasticSearch
>> connector as Output connector and Tika and Metadata adjuster as
>> Transformation connector
>>
>> Trying to crawl the files from SMB server with 64 GB of server and Start
>> option file of manifoldcf is being given 32GB of memory
>>  But many times got different errors while processing documents:-
>> *1) Access is denied*
>> *2) ... 23 more*
>>
>>
>> * WARN 2021-05-13T12:33:16,318 (Worker thread '6') - Service interruption
>> reported for job 1599130705168 connection 'Themas_Repo': Timeout or other
>> service interruption: Interrupted while acquiring credits WARN
>> 2021-05-13T12:33:17,315 (Worker thread '6') - JCIFS: Possibly transient
>> exception detected on attempt 1 while getting share security: Interrupted
>> while acquiring creditsjcifs.smb.SmbException: Interrupted while acquiring
>> credits*
>> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1530)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at
>> jcifs.smb.SmbSessionImpl.sessionSetupSMB2(SmbSessionImpl.java:549)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at jcifs.smb.SmbSessionImpl.sessionSetup(SmbSessionImpl.java:483)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:369)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:347)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at jcifs.smb.SmbTreeImpl.treeConnect(SmbTreeImpl.java:607)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at
>> jcifs.smb.SmbTreeConnection.connectTree(SmbTreeConnection.java:609)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at
>> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:563)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at
>> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:484)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at
>> jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:460)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at
>> jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:421)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at jcifs.smb.SmbFile.ensureTreeConnected(SmbFile.java:551)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at
>> jcifs.smb.SmbPipeHandleImpl.ensureTreeConnected(SmbPipeHandleImpl.java:111)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at
>> jcifs.smb.SmbPipeHandleImpl.ensureOpen(SmbPipeHandleImpl.java:166)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at
>> jcifs.smb.SmbPipeHandleImpl.sendrecv(SmbPipeHandleImpl.java:250)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at
>> jcifs.dcerpc.DcerpcPipeHandle.doSendReceiveFragment(DcerpcPipeHandle.java:113)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:243)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:216)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:234)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2337)
>> ~[jcifs-ng-2.1.2.jar:?]
>> at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2500)
>> [mcf-jcifs-connector.jar:2.14]
>> at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1261)
>> [mcf-jcifs-connector.jar:2.14]
>> at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:647)
>> [mcf-jcifs-connector.jar:2.14]
>> at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> [mcf-pull-agent.jar:?]
>> Caused by: java.io.InterruptedIOException: Interrupted while acquiring
>> credits
>> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:976)
>> ~[?:?]
>> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
>> ~[?:?]
>> ... 23 more
>> Caused by: java.lang.InterruptedException
>> at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
>> ~[?:1.8.0_292]
>> at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:582)
>> ~[?:1.8.0_292]
>> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:959)
>> ~[?:?]
>> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
>> ~[?:?]
>> ... 23 more
>>  WARN 2021-05-13T12:33:17,314 (Worker thread

Re: Interrupted while acquiring credits

2021-05-14 Thread Karl Wright

'

*JCIFS: Possibly transient exception detected on attempt 1 while getting
share security'Yes, it is going to retry.*

*Karl*

On Fri, May 14, 2021 at 1:45 AM ritika jain 
wrote:

> Hi,
> I am using Windows shares connector in manifoldcf 2.14 and ElasticSearch
> connector as Output connector and Tika and Metadata adjuster as
> Transformation connector
>
> Trying to crawl the files from SMB server with 64 GB of server and Start
> option file of manifoldcf is being given 32GB of memory
>  But many times got different errors while processing documents:-
> *1) Access is denied*
> *2) ... 23 more*
>
>
> * WARN 2021-05-13T12:33:16,318 (Worker thread '6') - Service interruption
> reported for job 1599130705168 connection 'Themas_Repo': Timeout or other
> service interruption: Interrupted while acquiring credits WARN
> 2021-05-13T12:33:17,315 (Worker thread '6') - JCIFS: Possibly transient
> exception detected on attempt 1 while getting share security: Interrupted
> while acquiring creditsjcifs.smb.SmbException: Interrupted while acquiring
> credits*
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1530)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbSessionImpl.sessionSetupSMB2(SmbSessionImpl.java:549)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbSessionImpl.sessionSetup(SmbSessionImpl.java:483)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:369)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:347)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbTreeImpl.treeConnect(SmbTreeImpl.java:607)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectTree(SmbTreeConnection.java:609)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:563)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:484)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:460)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:421)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbFile.ensureTreeConnected(SmbFile.java:551)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbPipeHandleImpl.ensureTreeConnected(SmbPipeHandleImpl.java:111)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbPipeHandleImpl.ensureOpen(SmbPipeHandleImpl.java:166)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbPipeHandleImpl.sendrecv(SmbPipeHandleImpl.java:250)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendReceiveFragment(DcerpcPipeHandle.java:113)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:243)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:216)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:234)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2337)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2500)
> [mcf-jcifs-connector.jar:2.14]
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1261)
> [mcf-jcifs-connector.jar:2.14]
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:647)
> [mcf-jcifs-connector.jar:2.14]
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
> Caused by: java.io.InterruptedIOException: Interrupted while acquiring
> credits
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:976)
> ~[?:?]
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[?:?]
> ... 23 more
> Caused by: java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> ~[?:1.8.0_292]
> at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:582)
> ~[?:1.8.0_292]
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:959)
> ~[?:?]
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[?:?]
> ... 23 more
>  WARN 2021-05-13T12:33:17,314 (Worker thread '2') - JCIFS: Possibly
> transient exception detected on attempt 2 while getting share security:
> Interrupted while acquiring credits
> jcifs.smb.SmbException: Interrupted while acquiring credits
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.
>
> Do  we have such functionality that in case of any error occurs like this.
> That it should skip the particular record and then continue to process
> further instead of

Re: Notification connector error

2021-05-11 Thread Karl Wright

This used to work fine, but I suspect that when SSH was declared unsafe, it
was disabled, and now only TLS will work.

Karl


On Tue, May 11, 2021 at 12:13 PM  wrote:

> Hello,
>
>
>
> I am trying to use an email notification connector but without success.
> When the connector tries to send an email I keep having the following error:
>
>
>
> Email: Error sending email: Could not convert socket to TLS
>
> javax.mail.MessagingException: Could not convert socket to TLS
>
> at
> com.sun.mail.smtp.SMTPTransport.startTLS(SMTPTransport.java:1918)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> com.sun.mail.smtp.SMTPTransport.protocolConnect(SMTPTransport.java:652)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Service.connect(Service.java:317)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Service.connect(Service.java:176)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Service.connect(Service.java:125)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Transport.send0(Transport.java:194)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Transport.send(Transport.java:124)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> org.apache.manifoldcf.crawler.notifications.email.EmailSession.send(EmailSession.java:112)
> ~[?:?]
>
> at
> org.apache.manifoldcf.crawler.notifications.email.EmailConnector$SendThread.run(EmailConnector.java:963)
> ~[?:?]
>
> Caused by: javax.net.ssl.SSLHandshakeException: No appropriate protocol
> (protocol is disabled or cipher suites are inappropriate)
>
> at
> sun.security.ssl.HandshakeContext.(HandshakeContext.java:170) ~[?:?]
>
> at
> sun.security.ssl.ClientHandshakeContext.(ClientHandshakeContext.java:98)
> ~[?:?]
>
> at
> sun.security.ssl.TransportContext.kickstart(TransportContext.java:221)
> ~[?:?]
>
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:433) ~[?:?]
>
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:411) ~[?:?]
>
> at
> com.sun.mail.util.SocketFetcher.configureSSLSocket(SocketFetcher.java:548)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> com.sun.mail.util.SocketFetcher.startTLS(SocketFetcher.java:485)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> com.sun.mail.smtp.SMTPTransport.startTLS(SMTPTransport.java:1913)
> ~[mail-1.4.5.jar:1.4.5]
>
> ... 8 more
>
>
>
>
>
> The connector is configured with a gmail SMTP, using the configuration
> recommended by the documentation:
>
>
>
> Hostname: smtp.gmail.com
>
> Port: 587
>
>
>
> Configuration properties:
>
> mail.smtp.ssl.trust : smtp.gmail.com
>
> mail.smtp.starttls.enable : true
>
> mail.smtp.auth : true
>
>
>
>
>
> The username and password I use are correct and I also tried with the
> office365 SMTP and I get the same error.
>
>
>
> I am using openjdk version "11.0.11" 2021-04-20. Do you have any idea
> about my issue ?
>
>
>
> Julien
>
>
>

Re: General questions

2021-04-12 Thread Karl Wright

Hi,

There was a book written but never published on ManifoldCF and how to write
connectors.  It's meant to be extended in that way.  The PDFs for the book
are available for free online, and they are linked through the manifoldcf
web site.

Karl

On Mon, Apr 12, 2021 at 8:49 AM koch  wrote:

> Hi everyone,
>
> I would like to know, what is planned for manifoldCF in the future?
> How much activity is in the project, or is there already an 'end of
> live' in sight?
>
> Is it compatible with Java11 or higher.
>
> Has someone tried to used it in an OSGI container like karaf?
>
> How can i expand manifold. If i would like to write my own repository or
> output connectors,
> do i have to plug them in at build time or is it possible to add
> connectors at runtime?
>
> Any help would be much appriciated.
>
> Kind regards,
> Matthias
>
>
>

Re: Manifoldcf Deletion Process

2021-03-30 Thread Karl Wright

Hi Ritika,

There is no deletion process.  Deletion takes place when a job is run in a
mode where deletion is possible (there are some where it is not).  The way
it takes place depends on the kind of repository connector (what model it
declares itself to use).

For the most common kinds of connectors, the job sequence involves scanning
all documents described by the job.  If the document is gone, it is deleted
right away.  If the document just wasn't accessed on the crawl, then and at
the end, those no-longer-referenced documents are removed.

Karl

On Tue, Mar 30, 2021 at 9:03 AM ritika jain 
wrote:

> Hi All,
>
> I want to understand the process of Manifoldcf Deletion . i.e in which all
> cases Deletion process (When checked in Simple History) executes.
> One case as per my knowledge , is the one whenever Seed URL of a
> particular job is changed.
> What all are the cases when Deletion process runs.
>
> My requirement to research whether Manifold is capable of handling  the
> scenario, say when a URL is existing and ingested in Elastic Index (say:-
> www.abc.com),
>
> Next time when job is run ,say the URL www.abc.com does not exist anymore
> and resulted in 404, Is Manifoldcf is capable of handling(by default) this
> 404 URL and deleting the URL from Database and from ElasticSearch Index(in
> which it was ingested already)..
>
> Any help will be thankful.
> Thanks
> Ritika
>

Re: Another Elasticsearch patch to allow the long URI

2021-03-25 Thread Shirai Takashi/ 白井隆

Hi, Karl.

Karl Wright wrote:
>I have now updated (I think) everything that this patch actually has, save
>for one deprecated field substitution (the "types" field is now the "doc_"

I've confirmed the updated sources via git://git.apache.org/manifoldcf.git,
to find some problem in the following codes.

connectors/elasticsearch/connector/src/main/java/org/apache/manifoldcf/agents/output/elasticsearch/ElasticSearchIndex.java:
if (useIngesterAttachment && inputStream != null) {
...(Clause1)...
}
if (useMapperAttachments && inputStream != null) {
...(Clause2)...
}
if (!useMapperAttachments && inputStream != null) {
...(Clause3)...
}

In the default case, it executes only Clause3.
If useMapperAttachments is set, it executes only Clause2.
(Both of useMapperAttachments and useIngesterAttachment will be never set.)
But if useIngesterAttachment is set, it executes Clause1 and Clause3.
These clause must be exclusive.
Each of Clause1 and Clause3 provides contentAttributeName field.
If both of them is executed, this field will be duplicated.

Please fix them as the following.
if (!useIngesterAttachment && !useMapperAttachments && inputStream != 
null) {
...(Clause3)...
}

P.S.
Where is "Ingester" from?
The strict name of plugin is "Ingest Attachment Processor Plugin".
https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

P.S.2
The stritc name of product is not "ElasticSearch" but "Elasticsearch".
https://www.elastic.co/what-is/elasticsearch

Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shi...@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

2021-03-21 Thread Shirai Takashi/ 白井隆

Hi, Karl.

Karl Wrightさんは書きました:
>field).  I would like to know more about this.  Does the "types" field no
>longer work?  Should we send both, in order to be sure that the connector
>works with most versions of ElasticSearch?  Please help clarify so that I
>can finish this off.

The "types" field is meaningless in 6.x, and deprecated in 7.x.
Please see the following.
https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html

You shouldn't delete this field for the reason of compatibility.
But the latest Elasticsearch can receive only '_doc',
then the default value should be '_doc'.


Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shi...@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

2021-03-20 Thread Karl Wright

I have now updated (I think) everything that this patch actually has, save
for one deprecated field substitution (the "types" field is now the "doc_"
field).  I would like to know more about this.  Does the "types" field no
longer work?  Should we send both, in order to be sure that the connector
works with most versions of ElasticSearch?  Please help clarify so that I
can finish this off.

The changes are committed to trunk; I would be very appreciative if  Shirai
Takashi/ 白井隆 reviewed them there.Thanks!
Karl

On Sat, Mar 20, 2021 at 4:32 AM Karl Wright  wrote:

> Hi,
>
> Please see https://issues.apache.org/jira/browse/CONNECTORS-1666 .
>
> I did not commit the patches as given because I felt that the fix was a
> relatively narrow one and it could be implemented with no user
> involvement.  Adding control for the user was therefore beyond the scope of
> the repair.
>
> There are more changes in these patches than just the ID length issue.  I
> am working to add this functionality as well but without anything I would
> consider to be unneeded.
> Karl
>
>
> On Fri, Mar 19, 2021 at 3:48 AM Karl Wright  wrote:
>
>> Thanks for the information.  I'll see what I can do.
>> Karl
>>
>>
>> On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 <
>> shi...@nintendo.co.jp> wrote:
>>
>>> Hi, Karl.
>>>
>>> Karl Wright wrote:
>>> >Hi - I'm still waiting for this patch to be attached to a ticket.  That
>>> is
>>> >the only way I believe we're allowed to accept it legally.
>>>
>>> Do you ask me to send the patch to the JIRA ticket?
>>> I can't access the JIRA because of our firewall.
>>> Sorry.
>>> What can I do without JIRA?
>>>
>>> 
>>> Nintendo, Co., Ltd.
>>> Product Technology Dept.
>>> Takashi SHIRAI
>>> PHONE: +81-75-662-9600
>>> mailto:shi...@nintendo.co.jp
>>>
>>

Re: Another Elasticsearch patch to allow the long URI

2021-03-20 Thread Karl Wright

Hi,

Please see https://issues.apache.org/jira/browse/CONNECTORS-1666 .

I did not commit the patches as given because I felt that the fix was a
relatively narrow one and it could be implemented with no user
involvement.  Adding control for the user was therefore beyond the scope of
the repair.

There are more changes in these patches than just the ID length issue.  I
am working to add this functionality as well but without anything I would
consider to be unneeded.
Karl

On Fri, Mar 19, 2021 at 3:48 AM Karl Wright  wrote:

> Thanks for the information.  I'll see what I can do.
> Karl
>
>
> On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 
> wrote:
>
>> Hi, Karl.
>>
>> Karl Wright wrote:
>> >Hi - I'm still waiting for this patch to be attached to a ticket.  That
>> is
>> >the only way I believe we're allowed to accept it legally.
>>
>> Do you ask me to send the patch to the JIRA ticket?
>> I can't access the JIRA because of our firewall.
>> Sorry.
>> What can I do without JIRA?
>>
>> 
>> Nintendo, Co., Ltd.
>> Product Technology Dept.
>> Takashi SHIRAI
>> PHONE: +81-75-662-9600
>> mailto:shi...@nintendo.co.jp
>>
>

Re: Another Elasticsearch patch to allow the long URI

2021-03-19 Thread Karl Wright

Thanks for the information.  I'll see what I can do.
Karl


On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, Karl.
>
> Karl Wright wrote:
> >Hi - I'm still waiting for this patch to be attached to a ticket.  That is
> >the only way I believe we're allowed to accept it legally.
>
> Do you ask me to send the patch to the JIRA ticket?
> I can't access the JIRA because of our firewall.
> Sorry.
> What can I do without JIRA?
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp
>

Re: Another Elasticsearch patch to allow the long URI

2021-03-18 Thread Shirai Takashi/ 白井隆

Hi, Karl.

Karl Wright wrote:
>Hi - I'm still waiting for this patch to be attached to a ticket.  That is
>the only way I believe we're allowed to accept it legally.

Do you ask me to send the patch to the JIRA ticket?
I can't access the JIRA because of our firewall.
Sorry.
What can I do without JIRA?

Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shi...@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

2021-03-18 Thread Karl Wright

Hi - I'm still waiting for this patch to be attached to a ticket.  That is
the only way I believe we're allowed to accept it legally.

Karl


On Thu, Mar 4, 2021 at 7:16 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, Karl.
>
> Karl Wrightさんは書きました:
> >I agree it is unlikely that the JDK will lose support for SHA-1 because it
> >is used commonly, as is MD5.  So please feel free to use it.
>
> I know.
> I think that SHA-1 is better on the whole.
> I don't care that apache-manifoldcf-elastic-id-2.patch.gz is discarded.
>
> SHA-256 is surely safer from the risk of collision.
> But the risk with SHA-1 can be ignored unless intension.
> It should be considered only when ManifoldCF is used for the worldwide
> data.
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp
>

Re: Another Elasticsearch patch to allow the long URI

2021-03-04 Thread Shirai Takashi/ 白井隆

Hi, Karl.

Karl Wrightさんは書きました:
>I agree it is unlikely that the JDK will lose support for SHA-1 because it
>is used commonly, as is MD5.  So please feel free to use it.

I know.
I think that SHA-1 is better on the whole.
I don't care that apache-manifoldcf-elastic-id-2.patch.gz is discarded.

SHA-256 is surely safer from the risk of collision.
But the risk with SHA-1 can be ignored unless intension.
It should be considered only when ManifoldCF is used for the worldwide data.


Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shi...@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

2021-03-04 Thread Karl Wright

I agree it is unlikely that the JDK will lose support for SHA-1 because it
is used commonly, as is MD5.  So please feel free to use it.

Karl


On Wed, Mar 3, 2021 at 7:54 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, Horn.
>
> Jörn Franke wrote:
> >Makes sense
>
> I don't think that it's easy.
>
>
> >>> Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when
> it will be removed from JDK.
>
> I also know SHA-1 is dangerous.
> Someone can generate the string which is hashed into the same SHA-1 to
> pretend another one.
> Then SHA-1 should not be used with certifications.
> The future JDK may stop using SHA-1 with certifications.
> But JDK will never stop supporting SHA-1 algorism.
>
> If SHA-1 is removed from JDK,
> ManifoldCF can not be built for reasons of another using of SHA-1.
> Some connectors already use SHA-1 as the ID value,
> then the previous saved records will be inaccessible.
> I can use SHA-256 with Elasticsearch connector.
> How should the other SHA-1 be managed?
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp
>

Re: Another Elasticsearch patch to allow the long URI

2021-03-03 Thread Shirai Takashi/ 白井隆

Hi, There.

Shirai Takashi/ 白井隆 wrote:
>I can use SHA-256 with Elasticsearch connector.

I've prepared the patch to support SHA-256.
It minimizes changes, to avoid the global effects.
It seems unbeautiful to include the try-catch clause.

I can't decide which is better.


Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shi...@nintendo.co.jp

apache-manifoldcf-elastic-id-2.patch.gz
Description: GNU Zip compressed data

Re: Another Elasticsearch patch to allow the long URI

2021-03-03 Thread Shirai Takashi/ 白井隆

Hi, Horn.

Jörn Franke wrote:
>Makes sense

I don't think that it's easy.

>>> Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when it 
>>> will be removed from JDK.

I also know SHA-1 is dangerous.
Someone can generate the string which is hashed into the same SHA-1 to pretend 
another one.
Then SHA-1 should not be used with certifications.
The future JDK may stop using SHA-1 with certifications.
But JDK will never stop supporting SHA-1 algorism.

If SHA-1 is removed from JDK,
ManifoldCF can not be built for reasons of another using of SHA-1.
Some connectors already use SHA-1 as the ID value,
then the previous saved records will be inaccessible.
I can use SHA-256 with Elasticsearch connector.
How should the other SHA-1 be managed?

Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shi...@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

2021-03-02 Thread Shirai Takashi/ 白井隆

Hi, Karl.

Karl Wright wrote:
>Backwards compatibility means that we very likely have to
>use the hash approach, and not use the decoding approach.

Do you object to the decoding?

It may be useless for the users with the alphabetical language.
But it's useful for the users with the multibyte language like as CJK.

Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shi...@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

2021-03-02 Thread Karl Wright

Hi - this is very helpful.  I would like you to officially create a ticket
in Jira: https://issues.apache.org/jira , project "CONNECTORS", and attach
these patches.  Backwards compatibility means that we very likely have to
use the hash approach, and not use the decoding approach.

Thanks,
Karl


On Mon, Mar 1, 2021 at 10:10 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, there.
>
> I've found another trouble in Elasticsearch connector.
> Elasticsearch output connector use the URI string as ID.
> Elasticsearch allows the length of ID no more than 512 bytes.
> If the URL length is too long, it causes HTTP 400 error.
>
> I prepare two solutions with this attached patch.
> The one is URI decoding.
> If the URI includes multibyte characters,
> the ID is URL encoded duplicately.
> Ex) U+3000 -> %E3%80%80 -> %25E3%2580%2580
> This enlarges the ID length unnecessarily.
> Then I add the option to decode URI as the ID before encoding.
>
> But the length may still longer than 512 bytes.
> The other solution is hashing.
> The new added options are the following.
> Raw) uses the URI string as is.
> Hash) hashes (SHA1) the URI string always.
> Hash if long) hashes the URI only if its length exceeds 512 bytes.
> The last one is prepared for the compatibility.
>
> Both of solutions cause a new problem.
> If the URI is decoded or hashed,
> the original URI cannot be keeped in each document.
> Then I add the new fields.
> URI field name) keeps the original URI string as is.
> Decoded URI field name) keeps the decoded URI string.
> The default settings provides these fields as empty.
>
>
> I sended the patch for Ingest-Attachment the other day.
> Then this mail attaches the two patches.
> apache-manifoldcf-2.18-elastic-id.patch.gz:
>  The patch for 2.18 including the patch of the other day.
> apache-manifoldcf-elastic-id.patch.gz:
>  The patch for the source patched the other day.
>
> By the way, I tryed to describe the above to some documents.
> But no suitable document is found in the ManifoldCF package.
> The Elasticsearch document may be wrote for the ancient spacifications.
> Where can I describe this new specifications?
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

2021-03-02 Thread Jörn Franke

Makes sense

> Am 02.03.2021 um 08:33 schrieb Shirai Takashi/ 白井隆 :
> 
> Hi, Jorn.
> 
> Jörn Franke wrote:
>> Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when it 
>> will be removed from JDK.
> 
> SHA-1 is used in the ManifoldCF existent class.
> (org.apache.manifoldcf.core.system.ManifoldCF)
> If "SHA" is replaced "SHA-256" in this class,
> the default algorism is updated entirely.
> I've just followed the standard of ManifoldCF.
> I also think SHA-256 or later is better.
> 
> Why the current ManifoldCF use SHA-1?
> This case may have to use SHA-1 depending on the reason.
> If the reason is only the compatibility,
> I can re-design the method ManifoldCF.hash(),
> to add the argument which indicates the algorism.
> 
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

2021-03-01 Thread Shirai Takashi/ 白井隆

Hi, Jorn.

Jörn Franke wrote:
>Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when it will 
>be removed from JDK.

SHA-1 is used in the ManifoldCF existent class.
(org.apache.manifoldcf.core.system.ManifoldCF)
If "SHA" is replaced "SHA-256" in this class,
the default algorism is updated entirely.
I've just followed the standard of ManifoldCF.
I also think SHA-256 or later is better.

Why the current ManifoldCF use SHA-1?
This case may have to use SHA-1 depending on the reason.
If the reason is only the compatibility,
I can re-design the method ManifoldCF.hash(),
to add the argument which indicates the algorism.

Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shi...@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

2021-03-01 Thread Jörn Franke

Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when it will 
be removed from JDK.

> Am 02.03.2021 um 04:10 schrieb Shirai Takashi/ 白井隆 :
> 
> Hi, there.
> 
> I've found another trouble in Elasticsearch connector.
> Elasticsearch output connector use the URI string as ID.
> Elasticsearch allows the length of ID no more than 512 bytes.
> If the URL length is too long, it causes HTTP 400 error.
> 
> I prepare two solutions with this attached patch.
> The one is URI decoding.
> If the URI includes multibyte characters,
> the ID is URL encoded duplicately.
> Ex) U+3000 -> %E3%80%80 -> %25E3%2580%2580
> This enlarges the ID length unnecessarily.
> Then I add the option to decode URI as the ID before encoding.
> 
> But the length may still longer than 512 bytes.
> The other solution is hashing.
> The new added options are the following.
> Raw) uses the URI string as is.
> Hash) hashes (SHA1) the URI string always.
> Hash if long) hashes the URI only if its length exceeds 512 bytes.
> The last one is prepared for the compatibility.
> 
> Both of solutions cause a new problem.
> If the URI is decoded or hashed,
> the original URI cannot be keeped in each document.
> Then I add the new fields.
> URI field name) keeps the original URI string as is.
> Decoded URI field name) keeps the decoded URI string.
> The default settings provides these fields as empty.
> 
> 
> I sended the patch for Ingest-Attachment the other day.
> Then this mail attaches the two patches.
> apache-manifoldcf-2.18-elastic-id.patch.gz:
> The patch for 2.18 including the patch of the other day.
> apache-manifoldcf-elastic-id.patch.gz:
> The patch for the source patched the other day.
> 
> By the way, I tryed to describe the above to some documents.
> But no suitable document is found in the ManifoldCF package.
> The Elasticsearch document may be wrote for the ancient spacifications.
> Where can I describe this new specifications?
> 
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp
> 
>

Re: Patch contribution to support Ingest-Attachment for Elasticsearch

2021-02-25 Thread Shirai Takashi/ 白井隆

Hi, there.

Shirai Takashi wrote:
>ManifoldCF can use mapping-attachments plugin for Elasticsearch connector.
>But it is obsolete, to recommend ingest-attachment plugin instead.
>I try to support this plugin with the attached patch.

Sorry, I have some mistake with this patch.
Please replace it with the one attached to this mail.

I've failed to copy the text,
to output the excessive '}' in the JSON code.
Elasticsearch treats the lack of '}' as an error,
and allows the surplus of '}'.
Then the previous patch causes no error, but it is not right.

Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shi...@nintendo.co.jp

apache-manifoldcf-2.18.patch.gz
Description: GNU Zip compressed data

Re: Multiprocess file installation of manifold

2021-02-17 Thread Karl Wright

File synchronization is still supported but is deprecated.  We recommend
zookeeper synchronization unless you have a very good reason not to.

Karl


On Wed, Feb 17, 2021 at 12:26 PM Ananth Peddinti  wrote:

> Hello Team ,
>
>
> I would like to know if someone has already done multi-process model
>   installation of manifold on linux machine .I would like to know the
> process in detail.We are running into issues with the quick start model.
>
>
>
> Regards
>
> Ananth
> --
> 
> -SECURITY/CONFIDENTIALITY WARNING-
>
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of
> the communication is strictly prohibited. If you received the communication
> in error, please notify the sender immediately by replying to this message
> and deleting the message and any accompanying files from your system. If,
> due to the security risks, you do not wish to receive further
> communications via e-mail, please reply to this message and inform the
> sender that you do not wish to receive further e-mail from the sender.
> (LCP301)
> 
>

Re: Job Content Length issue

2021-02-17 Thread Karl Wright

The internal Tika is not memory bounded; some transformations stream, but
others put everything into memory.

You can try using the external tika, with a tika instance you run
separately, and that would likely help.  But you may need to give it lots
of memory too.

Karl


On Wed, Feb 17, 2021 at 3:50 AM ritika jain 
wrote:

> Hi Karl,
>
> I am using Elastic search as an output connector and yes using an internal
> Tika extracter, not using solr output connection.
>
> Also Elastic search server is on hosted on different server with huge
> memory allocation.
>
> On Tue, Feb 16, 2021 at 7:29 PM Karl Wright  wrote:
>
>> Hi, do you mean content limiter length of 100?
>>
>> I assume you are using the internal Tika transformer?  Are you combining
>> this with a Solr output connection that is not using the extract handler?
>>
>> By "manifold crashes" I assume you actually mean it runs out of memory.
>> The "long running query" concern is a red herring because that does not
>> cause a crash under any circumstances.
>>
>> This is quite likely if I described your setup above, because if you do
>> not use the Solr extract handler, the entire content of every document must
>> be loaded into memory.  That is why we require you to fill in a Solr field
>> on those kind of output connections that limits the number of bytes.
>>
>> Karl
>>
>>
>> On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
>> wrote:
>>
>>>
>>>
>>> Hi users
>>>
>>>
>>> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
>>> server which is having some millions billions of records to process and
>>> crawl.
>>>
>>> Total system memory is 64Gb out of which start options file of manifold
>>> is defined as 32GB.
>>>
>>> We have some larger files to crawl around 30 MB of file or more that
>>> than .
>>>
>>> When mentioned size in the content limiter tab is 10 that is 1 MB
>>> job works fine but when changed to 1000 that is 10 MB .. manifold
>>> crashes with some logs with long running query .
>>>
>>> How we can achieve or optimise job specifications to process large
>>> documents also.
>>>
>>> Do I need to increase or decrease the number of connections or number of
>>> worker thread count or something.
>>>
>>> Can anybody help me on this to crawl larger files too at least till 10 MB
>>>
>>> Thanks
>>>
>>> Ritika
>>>
>>

Re: Job Content Length issue

2021-02-17 Thread ritika jain

Hi Karl,

I am using Elastic search as an output connector and yes using an internal
Tika extracter, not using solr output connection.

Also Elastic search server is on hosted on different server with huge
memory allocation.

On Tue, Feb 16, 2021 at 7:29 PM Karl Wright  wrote:

> Hi, do you mean content limiter length of 100?
>
> I assume you are using the internal Tika transformer?  Are you combining
> this with a Solr output connection that is not using the extract handler?
>
> By "manifold crashes" I assume you actually mean it runs out of memory.
> The "long running query" concern is a red herring because that does not
> cause a crash under any circumstances.
>
> This is quite likely if I described your setup above, because if you do
> not use the Solr extract handler, the entire content of every document must
> be loaded into memory.  That is why we require you to fill in a Solr field
> on those kind of output connections that limits the number of bytes.
>
> Karl
>
>
> On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
> wrote:
>
>>
>>
>> Hi users
>>
>>
>> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
>> server which is having some millions billions of records to process and
>> crawl.
>>
>> Total system memory is 64Gb out of which start options file of manifold
>> is defined as 32GB.
>>
>> We have some larger files to crawl around 30 MB of file or more that than
>> .
>>
>> When mentioned size in the content limiter tab is 10 that is 1 MB job
>> works fine but when changed to 1000 that is 10 MB .. manifold crashes
>> with some logs with long running query .
>>
>> How we can achieve or optimise job specifications to process large
>> documents also.
>>
>> Do I need to increase or decrease the number of connections or number of
>> worker thread count or something.
>>
>> Can anybody help me on this to crawl larger files too at least till 10 MB
>>
>> Thanks
>>
>> Ritika
>>
>

Re: Job Content Length issue

2021-02-16 Thread Karl Wright

Hi, do you mean content limiter length of 100?

I assume you are using the internal Tika transformer?  Are you combining
this with a Solr output connection that is not using the extract handler?

By "manifold crashes" I assume you actually mean it runs out of memory.
The "long running query" concern is a red herring because that does not
cause a crash under any circumstances.

This is quite likely if I described your setup above, because if you do not
use the Solr extract handler, the entire content of every document must be
loaded into memory.  That is why we require you to fill in a Solr field on
those kind of output connections that limits the number of bytes.

Karl

On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
wrote:

>
>
> Hi users
>
>
> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
> server which is having some millions billions of records to process and
> crawl.
>
> Total system memory is 64Gb out of which start options file of manifold is
> defined as 32GB.
>
> We have some larger files to crawl around 30 MB of file or more that than .
>
> When mentioned size in the content limiter tab is 10 that is 1 MB job
> works fine but when changed to 1000 that is 10 MB .. manifold crashes
> with some logs with long running query .
>
> How we can achieve or optimise job specifications to process large
> documents also.
>
> Do I need to increase or decrease the number of connections or number of
> worker thread count or something.
>
> Can anybody help me on this to crawl larger files too at least till 10 MB
>
> Thanks
>
> Ritika
>

Re: content length tab

2021-02-15 Thread Karl Wright

This parameter is in bytes.

Karl


On Mon, Feb 15, 2021 at 9:03 AM ritika jain 
wrote:

> Hi Users,
>
> Can anybody tell me if this can be filled as bytes or kilobytes here.
>
> The "Content Length tab looks like this:
>
>
> [image: Windows Share Job, Content Length tab]
>
> Values are to be filled as 100 , will this be 100 bytes or 100 kilobytes
> or in MB.
>
> Thanks
> Ritika
>

Re: Job status stuck in terminating

2021-01-07 Thread Karl Wright

Hi,

Usually the reason a job doesn't complete is because a document is retrying
indefinitely.  You can see what's going on by looking at the Simple History
job report, or, if you prefer, tailing the manifoldcf log.

Other times a job won't complete because somebody shut down the agents
process.  But that is not the answer for the simple example single-process
deployment.

Karl


On Thu, Jan 7, 2021 at 5:52 PM Isaac Kunz  wrote:

> I have a job that is stuck in terminating for 12 hrs. it is a small test
> job and I am wondering if there is a way to fix this? The job ran once and
> completed 175k documents. I modified the query to the job and reseeded. The
> job was modified to process a smaller document set. I assume reseeding will
> allow the same documents to be indexed. I do not need the metadata for this
> job so if needed I could clear it via db if I knew how. I am a new user.
>
> Thanks,
>
> Isaac
> --
> 
> -SECURITY/CONFIDENTIALITY WARNING-
>
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of
> the communication is strictly prohibited. If you received the communication
> in error, please notify the sender immediately by replying to this message
> and deleting the message and any accompanying files from your system. If,
> due to the security risks, you do not wish to receive further
> communications via e-mail, please reply to this message and inform the
> sender that you do not wish to receive further e-mail from the sender.
> (LCP301)
> 
>

RE: Job status stuck in terminating

2021-01-07 Thread Isaac Kunz


I have a job that is stuck in terminating for 12 hrs. it is a small test job 
and I am wondering if there is a way to fix this? The job ran once and 
completed 175k documents. I modified the query to the job and reseeded. The job 
was modified to process a smaller document set. I assume reseeding will allow 
the same documents to be indexed. I do not need the metadata for this job so if 
needed I could clear it via db if I knew how. I am a new user.

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
29192 root  20   0 11.991g 4.206g  29136 S   0.7 13.5 101:02.08 
/usr/bin/java -Xms4096M -Xmx8192M -jar 
/opt/manifold/apache-manifoldcf-2.17/example/start.jar
  795 root  20   0  305112   6200   4808 S   0.3  0.0  75:56.82 
/usr/bin/vmtoolsd
32652 postgres  20   0  237348  30456  27420 S   0.3  0.1   0:11.41 postgres: 
manifoldcf dbname 127.0.0.1(60486) idle
The java process running manifold is also frozen. Any suggestions for recovery?



Thanks,
Isaac


-SECURITY/CONFIDENTIALITY WARNING-

This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to receive further communications via e-mail, please reply to this message 
and inform the sender that you do not wish to receive further e-mail from the 
sender. (LCP301)

Re: Indexation Not OK

2021-01-01 Thread Karl Wright

Hi,
I don't have the ability to delete mail from mailing lists.  You have to
request Apache Infra do that.

Karl


On Thu, Dec 31, 2020 at 11:38 AM Michael Cizmar 
wrote:

> Ritika – We have had some discussions regarding docker and etc.  The
> public one that is out there builds a single node and does not use an
> RDBM.  I would not recommend using that to index billions of documents.
> You can turn on debugging in the connector and look at the logs to see if
> that traffic is actually going to Elastic search.
>
>
>
> Karl – I believe Ritika said Elastic.
>
>
>
>
>
> --
>
> Michael Cizmar
>
>
>
> *From:* ritika jain 
> *Sent:* Thursday, December 31, 2020 7:33 AM
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Indexation Not OK
>
>
>
> Elastic search output connector with some custom changes for some fields
>
> On Thursday, December 31, 2020, Karl Wright  wrote:
>
> Hi,
> Can you let us know what you are using for the output connector?
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Dec 31, 2020 at 8:24 AM ritika jain 
> wrote:
>
> Hi,
>
>
>
> I am using Manifoldcf 2.14 and JCIFS connector, to ingest some billions of
> records into elastic search
>
> I am facing an issue in which when Job is run some time, successful
> indexation happens but after sometime , manifoldcf loops the records and
> Indexation is not getting OK.
>
>
>
>
>
> and it keeps on retrying for those specific records, then to again start
> up, I need to restart the docker container everytime and after restart
> Indexation works fine for those records too.
>
> And also checked JSON formation of elastic search connector is fine, which
> sures that the files are not having any problem.
>
> Can anybody please guide me the reason for this
>
>
>
> Thanks
>
> Ritika
>
>
>
>
>
>

RE: Indexation Not OK

2020-12-31 Thread Michael Cizmar

Ritika – We have had some discussions regarding docker and etc.  The public one 
that is out there builds a single node and does not use an RDBM.  I would not 
recommend using that to index billions of documents.  You can turn on debugging 
in the connector and look at the logs to see if that traffic is actually going 
to Elastic search.

Karl – I believe Ritika said Elastic.


--
Michael Cizmar


From: ritika jain 
Sent: Thursday, December 31, 2020 7:33 AM
To: user@manifoldcf.apache.org
Subject: Re: Indexation Not OK

Elastic search output connector with some custom changes for some fields

On Thursday, December 31, 2020, Karl Wright 
mailto:daddy...@gmail.com>> wrote:
Hi,
Can you let us know what you are using for the output connector?
Thanks,
Karl


On Thu, Dec 31, 2020 at 8:24 AM ritika jain 
mailto:ritikajain5...@gmail.com>> wrote:
Hi,

I am using Manifoldcf 2.14 and JCIFS connector, to ingest some billions of 
records into elastic search
I am facing an issue in which when Job is run some time, successful indexation 
happens but after sometime , manifoldcf loops the records and Indexation is not 
getting OK.

[cid:image003.png@01D6DF61.14FDAFC0]

and it keeps on retrying for those specific records, then to again start up, I 
need to restart the docker container everytime and after restart Indexation 
works fine for those records too.
And also checked JSON formation of elastic search connector is fine, which 
sures that the files are not having any problem.
Can anybody please guide me the reason for this

Thanks
Ritika

Re: Indexation Not OK

2020-12-31 Thread Karl Wright

Sorry, I couldn't quite understand everything in your email, but it sounds
like the problem is in the ES connection.  It is possible that ES expires
your connection and the indexing fails after that happens.  If that is
happening, however, I would expect to see a much more detailed error
message in both the logs and in the simple history.  Can you provide any
error messages from the log that seem to be coming from the output
connection?

Thanks,
Karl

On Thu, Dec 31, 2020 at 8:30 AM Karl Wright  wrote:

> Hi,
> Can you let us know what you are using for the output connector?
> Thanks,
> Karl
>
>
> On Thu, Dec 31, 2020 at 8:24 AM ritika jain 
> wrote:
>
>> Hi,
>>
>> I am using Manifoldcf 2.14 and JCIFS connector, to ingest some billions
>> of records into elastic search
>> I am facing an issue in which when Job is run some time, successful
>> indexation happens but after sometime , manifoldcf loops the records and
>> Indexation is not getting OK.
>>
>> [image: image.png]
>>
>> and it keeps on retrying for those specific records, then to again start
>> up, I need to restart the docker container everytime and after restart
>> Indexation works fine for those records too.
>> And also checked JSON formation of elastic search connector is fine,
>> which sures that the files are not having any problem.
>> Can anybody please guide me the reason for this
>>
>> Thanks
>> Ritika
>>
>>
>>

Re: Indexation Not OK

2020-12-31 Thread ritika jain

Elastic search output connector with some custom changes for some fields

On Thursday, December 31, 2020, Karl Wright  wrote:

> Hi,
> Can you let us know what you are using for the output connector?
> Thanks,
> Karl
>
>
> On Thu, Dec 31, 2020 at 8:24 AM ritika jain 
> wrote:
>
>> Hi,
>>
>> I am using Manifoldcf 2.14 and JCIFS connector, to ingest some billions
>> of records into elastic search
>> I am facing an issue in which when Job is run some time, successful
>> indexation happens but after sometime , manifoldcf loops the records and
>> Indexation is not getting OK.
>>
>> [image: image.png]
>>
>> and it keeps on retrying for those specific records, then to again start
>> up, I need to restart the docker container everytime and after restart
>> Indexation works fine for those records too.
>> And also checked JSON formation of elastic search connector is fine,
>> which sures that the files are not having any problem.
>> Can anybody please guide me the reason for this
>>
>> Thanks
>> Ritika
>>
>>
>>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2608 matches

Mail list logo