from:"Karl Wright"

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright

No, only the seed URLs get updated with that option.


On Tue, Sep 26, 2023 at 10:09 AM Marisol Redondo <
marisol.redondo.gar...@gmail.com> wrote:

> Thanks a lot for the explanation, Karl, really useful.
>
> I will wait for your reply at the end of the week, but I thought that the
> main reason for the option "Reset seeding" was for that, for reevaluating
> all pages, as a new fresh execution.
>
>
> On Tue, 26 Sept 2023 at 13:30, Karl Wright  wrote:
>
>> Okay, that is good to know.
>> The hopcount assessment occurs when documents are added to the queue.
>> Hopcounts are stored for each document in the hopcount table.  So if you
>> change a hopcount limit, it is quite possible that nothing will change
>> unless documents that are at the previous hopcount limit are re-evaluated.
>> I believe there is no logic in ManifoldCF for that at this time, but I'd
>> have to review the codebase to be certain of that.
>>
>> What that means is that you can't increase the hopcount limit and expect
>> the next crawl to pick up the documents you excluded before with the
>> hopcount mechanism.  Only when the documents need to be rescanned for some
>> other reason would that happen as it stands now.  But I will get back to
>> you after a review at the end of the week.
>>
>> Karl
>>
>> Karl
>>
>>
>> On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo <
>> marisol.redondo.gar...@gmail.com> wrote:
>>
>>> No, I haven't used this options, I have it configured as "Keep
>>> unreachable documents, for now", but it's also ignoring them because they
>>> were already kept?. With this option, when the unreachable document for now
>>> are converted to forever?
>>>
>>> The only solution I can think on is creating a new job with the exact
>>> same characteristics and run it.
>>>
>>> Regards and thanks
>>>Marisol
>>>
>>>
>>>
>>> On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:
>>>
>>>> If you ever set "Ignore unreachable documents forever" for the job, you
>>>> can't go back and stop ignoring them.  The data that the job would need to
>>>> have recorded for this is gone.  The only way to get it back is if you can
>>>> convince the ManifoldCF to recrawl all documents in the job.
>>>>
>>>>
>>>> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
>>>> marisol.redondo.gar...@gmail.com> wrote:
>>>>
>>>>>
>>>>> Hi, I had a problem with document out of scope
>>>>>
>>>>> I change the Maximum hop count for type "redirect" in one of my job to
>>>>> 5, and saw that the job is not processing some pages because of that, so I
>>>>> removed the value to get them injecting into the output connector (Solr
>>>>> connector)
>>>>> After that, the same pages are still out of scope like the limit has
>>>>> been set to 1, and they are not indexed.
>>>>>
>>>>> I have tried to "Reset seeding" thinking that maybe the pages need to
>>>>> be check again, but still having the same problem, I don't think the
>>>>> problem is with the output, but I have also use the option "Re-index all
>>>>> associated documents" and "Remove all associated records" with the same
>>>>> result
>>>>> I don't want to clear the history in the repository, that it's a
>>>>> website connector, as I don't want to lost all the history.
>>>>>
>>>>> Is this a bug in Manifold? Is there any option to fix this issue?
>>>>>
>>>>> I'm using Manifold version 2.24.
>>>>>
>>>>> Thanks
>>>>> Marisol
>>>>>
>>>>>

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright

Okay, that is good to know.
The hopcount assessment occurs when documents are added to the queue.
Hopcounts are stored for each document in the hopcount table.  So if you
change a hopcount limit, it is quite possible that nothing will change
unless documents that are at the previous hopcount limit are re-evaluated.
I believe there is no logic in ManifoldCF for that at this time, but I'd
have to review the codebase to be certain of that.

What that means is that you can't increase the hopcount limit and expect
the next crawl to pick up the documents you excluded before with the
hopcount mechanism.  Only when the documents need to be rescanned for some
other reason would that happen as it stands now.  But I will get back to
you after a review at the end of the week.

Karl

Karl


On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo <
marisol.redondo.gar...@gmail.com> wrote:

> No, I haven't used this options, I have it configured as "Keep unreachable
> documents, for now", but it's also ignoring them because they were already
> kept?. With this option, when the unreachable document for now are
> converted to forever?
>
> The only solution I can think on is creating a new job with the exact same
> characteristics and run it.
>
> Regards and thanks
>Marisol
>
>
>
> On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:
>
>> If you ever set "Ignore unreachable documents forever" for the job, you
>> can't go back and stop ignoring them.  The data that the job would need to
>> have recorded for this is gone.  The only way to get it back is if you can
>> convince the ManifoldCF to recrawl all documents in the job.
>>
>>
>> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
>> marisol.redondo.gar...@gmail.com> wrote:
>>
>>>
>>> Hi, I had a problem with document out of scope
>>>
>>> I change the Maximum hop count for type "redirect" in one of my job to
>>> 5, and saw that the job is not processing some pages because of that, so I
>>> removed the value to get them injecting into the output connector (Solr
>>> connector)
>>> After that, the same pages are still out of scope like the limit has
>>> been set to 1, and they are not indexed.
>>>
>>> I have tried to "Reset seeding" thinking that maybe the pages need to be
>>> check again, but still having the same problem, I don't think the problem
>>> is with the output, but I have also use the option "Re-index all associated
>>> documents" and "Remove all associated records" with the same result
>>> I don't want to clear the history in the repository, that it's a website
>>> connector, as I don't want to lost all the history.
>>>
>>> Is this a bug in Manifold? Is there any option to fix this issue?
>>>
>>> I'm using Manifold version 2.24.
>>>
>>> Thanks
>>> Marisol
>>>
>>>

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright

If you ever set "Ignore unreachable documents forever" for the job, you
can't go back and stop ignoring them.  The data that the job would need to
have recorded for this is gone.  The only way to get it back is if you can
convince the ManifoldCF to recrawl all documents in the job.


On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
marisol.redondo.gar...@gmail.com> wrote:

>
> Hi, I had a problem with document out of scope
>
> I change the Maximum hop count for type "redirect" in one of my job to 5,
> and saw that the job is not processing some pages because of that, so I
> removed the value to get them injecting into the output connector (Solr
> connector)
> After that, the same pages are still out of scope like the limit has been
> set to 1, and they are not indexed.
>
> I have tried to "Reset seeding" thinking that maybe the pages need to be
> check again, but still having the same problem, I don't think the problem
> is with the output, but I have also use the option "Re-index all associated
> documents" and "Remove all associated records" with the same result
> I don't want to clear the history in the repository, that it's a website
> connector, as I don't want to lost all the history.
>
> Is this a bug in Manifold? Is there any option to fix this issue?
>
> I'm using Manifold version 2.24.
>
> Thanks
> Marisol
>
>

Re: web crawler https

2023-09-25 Thread Karl Wright

See this article:

https://stackoverflow.com/questions/6784463/error-trustanchors-parameter-must-be-non-empty

ManifoldCF web crawler configuration allows you to drop certs into a local
trust store for the connection.  You need to either do that (adding
whatever certificate authority cert you think might be missing), or by
checking the "trust https" checkbox.

You can generally debug what certs a site might need by trying to fetch a
page with curl and using verbose debug mode.

Karl

On Mon, Sep 25, 2023 at 10:48 AM Bisonti Mario 
wrote:

> Hi,
>
> I would like to try indexing a Wordpress internal site.
>
> I tried to configure Repository Web, Job with seeds but I always obtain:
>
>
>
> WARN 2023-09-25T16:31:50,905 (Worker thread '4') - Service interruption
> reported for job 1695649924581 connection 'Wp': IO exception
> (javax.net.ssl.SSLException)reading header: Unexpected error:
> java.security.InvalidAlgorithmParameterException: the trustAnchors
> parameter must be non-empty
>
>
>
> How could I solve?
>
> Thanks a lot
>
> Mario
>
>

Re: Solr connector authentication issue

2023-06-07 Thread Karl Wright

But if those are set, and the connection health check passes, then I can't
tell you why Solr is unhappy with your connection.  It's clearly working
sometimes.  I'd look on the Solr end to figure out whether its rejection is
coming from just one of your instances.



On Wed, Jun 7, 2023 at 7:49 AM Karl Wright  wrote:

> The Solr output connection configuration contains all credentials that are
> sent to Solr.  If those aren't set Solr won't get them.
>
> Karl
>
>
> On Wed, Jun 7, 2023 at 7:23 AM Marisol Redondo <
> marisol.redondo.gar...@gmail.com> wrote:
>
>> Hi,
>>
>> We are using Solr 8 with basic authentication, and when checking the
>> output connection I'm getting an Exception "Solr authorization failure,
>> code 401: aborting job"
>>
>> The solr type is Solrcloud, as we have 3 server (installed in AWS
>> Kubernette containers), I have set the user ID and password in the Sever
>> tab and can connect to Zookeeper and solr, as, if I unchecked the option
>> "Bock anonymous request", the connector is working.
>>
>> How can I make the connection working? I can't unchecked the "Block
>> anonymous request"
>> Am I missing any other configuration?
>> Is there any other place where I have to set the user and password?
>>
>> Thanks
>> Marisol
>>
>>

Re: Solr connector authentication issue

2023-06-07 Thread Karl Wright

The Solr output connection configuration contains all credentials that are
sent to Solr.  If those aren't set Solr won't get them.

Karl


On Wed, Jun 7, 2023 at 7:23 AM Marisol Redondo <
marisol.redondo.gar...@gmail.com> wrote:

> Hi,
>
> We are using Solr 8 with basic authentication, and when checking the
> output connection I'm getting an Exception "Solr authorization failure,
> code 401: aborting job"
>
> The solr type is Solrcloud, as we have 3 server (installed in AWS
> Kubernette containers), I have set the user ID and password in the Sever
> tab and can connect to Zookeeper and solr, as, if I unchecked the option
> "Bock anonymous request", the connector is working.
>
> How can I make the connection working? I can't unchecked the "Block
> anonymous request"
> Am I missing any other configuration?
> Is there any other place where I have to set the user and password?
>
> Thanks
> Marisol
>
>

Re: Long Job on Windows Share

2023-05-25 Thread Karl Wright

The jcifs connector does not include a lot of information in the version
string for a file - basically, the length, and the modified date.  So I
would not expect there to be lot of actual work involved if there are no
changes to a document.

The activity "access" does imply that the system believes that the document
does need to be reindexed.  It clearly reads the document properly.  I
would check to be sure it actually indexes the document.  I suspect that
your job may be reading the file but determining it is not suitable for
indexing and then repeating that every day.  You can see this by looking
for the document in the activity log to see what ManifoldCF decided to do
with it.

Karl

On Thu, May 25, 2023 at 6:03 AM Bisonti Mario 
wrote:

> Hi,
>
> I would like to understand how recrawl works
>
>
>
> My job scan, using “Connection Type”  “Windows shares” works for near 18
> hours.
>
> My document numebr a little bit of 1 million.
>
>
>
> If I check the documents scan from MifoldCF I see, for example:
>
>
>
> It seems that re work on the document every day even if it hadn’t been
> modified.
>
> So, is it right or I chose a wrong job to crawl the documents?
>
>
>
> Thanks a lot
>
> Mario
>
>
>
>
>

Re: Apache Manifold Documentum connector

2023-03-17 Thread Karl Wright

It was open-sourced back in 2012 at the same time ManifoldCF was
open-sourced.  It was written by a contractor paid by MetaCarta, who also
paid for the development of ManifoldCF itself (I developed that).  It was
spun off as open source when MetaCarta was bought by Nokia who had no
interest in the framework or the connectors.

I do not, off the top of my head, remember the contractor's name nor have
his contact information any longer.

There are many users of the Documentum Connector, however, and I would hope
one of them with more DQL experience will respond.

Karl



On Fri, Mar 17, 2023 at 5:41 AM Rasťa Šíša  wrote:

> Hi Karl, thanks for your answer! Would you be able to point me towards the
> author/git branch of the documentum connector?
> Best regards, Rasta
>
> čt 16. 3. 2023 v 20:58 odesílatel Karl Wright  napsal:
>
>> Hi,
>>
>> I didn't write the documentum connector initially, so I trust that the
>> engineer who did knew how to construct the proper DQL.  I've not seen any
>> bugs related to it so it does seem to work.
>>
>> Karl
>>
>>
>> On Thu, Mar 16, 2023 at 8:23 AM Rasťa Šíša  wrote:
>>
>>> Hello,
>>> i would like to ask how does Documentum Manifold connector select the
>>> latest version from Documentum system?
>>>
>>> The first query that gets composed collects list of i_chronicle_id in
>>> DCTM.java. I would like to know though, how does the Manifold recognize the
>>> latest version of the document(e.g. Effective status).
>>> From the ui, i am able to select some of the objecttypes, but not
>>> objecttypes (all).
>>>
>>> In dql it is just e.g.
>>> *select i_chronicle_id from   *
>>> instead of *select i_chronicle_id from  (all)
>>> . *
>>>
>>> This "(all) object" returns all of them. The internal functioning of
>>> documentum though, with the first type of query, does not select
>>> i_chronicle_id of documents, that have a newly created version e.g. the
>>> document is created approved and effective, but someone already created a
>>> new draft for it. with the (all) in the dql, it brings in all the documents
>>> and their r_object_id, among which we can select the effective version by
>>> status.
>>> Is this a bug in manifold documentum connector, that it does not allow
>>> you to select those (all) objects and select those documents with new
>>> versions?
>>> Best regards,
>>> Rastislav Sisa
>>>
>>

Re: Apache Manifold Documentum connector

2023-03-16 Thread Karl Wright

Hi,

I didn't write the documentum connector initially, so I trust that the
engineer who did knew how to construct the proper DQL.  I've not seen any
bugs related to it so it does seem to work.

Karl


On Thu, Mar 16, 2023 at 8:23 AM Rasťa Šíša  wrote:

> Hello,
> i would like to ask how does Documentum Manifold connector select the
> latest version from Documentum system?
>
> The first query that gets composed collects list of i_chronicle_id in
> DCTM.java. I would like to know though, how does the Manifold recognize the
> latest version of the document(e.g. Effective status).
> From the ui, i am able to select some of the objecttypes, but not
> objecttypes (all).
>
> In dql it is just e.g.
> *select i_chronicle_id from   *
> instead of *select i_chronicle_id from  (all)
> . *
>
> This "(all) object" returns all of them. The internal functioning of
> documentum though, with the first type of query, does not select
> i_chronicle_id of documents, that have a newly created version e.g. the
> document is created approved and effective, but someone already created a
> new draft for it. with the (all) in the dql, it brings in all the documents
> and their r_object_id, among which we can select the effective version by
> status.
> Is this a bug in manifold documentum connector, that it does not allow you
> to select those (all) objects and select those documents with new versions?
> Best regards,
> Rastislav Sisa
>

Re: Job stucked with cleaning up status

2023-02-03 Thread Karl Wright

he loop making thread terminates normally! In a quite a
> short time I always ends up with no `DocumentDeleteThread`s at all and the
> framework transit to the incosistent state.
>
> In the end, I made Solr back online and managed to finish deletion
> successfully. But I think this case should be handled in some way.
>
> With respect,
> Abeleshev Artem
>
> On Sun, Jan 29, 2023 at 10:36 PM Karl Wright  wrote:
>
>> Hi,
>>
>> 2.22 makes no changes to the way document deletions are processed over
>> probably 10 previous versions of ManifoldCF.
>>
>> What likely is the case is that the connection to the output for the job
>> you are cleaning up is down.  When that happens, the documents are queued
>> but the delete worker threads cannot make any progress.
>>
>> You can see this maybe by looking at the "Simple Reports" for the job in
>> question and see what it is doing and why the deletions are not succeeding.
>>
>> Karl
>>
>>
>> On Sun, Jan 29, 2023 at 8:18 AM Artem Abeleshev <
>> artem.abeles...@rondhuit.com> wrote:
>>
>>> Hi, everyone!
>>>
>>> Another problem that I got sometimes. We are using ManifoldCF 2.22.1
>>> with multiple nodes in our production. The creation of the MCF job pipeline
>>> is handled via the API calls from our service. We create jobs, repositories
>>> and output repositories. The crawler extracts documents and then they are
>>> pushed to the Solr. The pipeline works OK.
>>>
>>> The problem is about deleteing the job. Sometimes the job get stucked
>>> with a `Cleaning up` status (in DB it has status `e` that corresponds to
>>> status `STATUS_DELETING`). This time I have used MCF Web Admin to delete
>>> the job (pressed the delete button on the job list page).
>>>
>>> I have checked sources and debug it a bit. The method
>>> `deleteJobsReadyForDelete()`
>>> (`org.apache.manifoldcf.crawler.jobs.JobManager.deleteJobsReadyForDelete()`)
>>> is works OK. It is unable to delete the job cause it still found some
>>> documents in the document's queue table. The following SQL is executed
>>> within this method:
>>>
>>> ```sql
>>> select id from jobqueue where jobid = '1658215015582' and (status = 'E'
>>> or status = 'D') limit 1;
>>> ```
>>>
>>> where `E` status stands for `STATUS_ELIGIBLEFORDELETE` and `D` status
>>> stands for `STATUS_BEINGDELETED`. If at least one of such a documents is
>>> found in the queue it will do nothing. At the moment I had a lot of
>>> documents resided within the `jobqueue` having indicated statuses (actually
>>> all of them have `D` status).
>>>
>>> I see that `Documents delete stuffer thread` is running, and it set
>>> status `STATUS_BEINGDELETED` to the documents via the
>>> `getNextDeletableDocuments()` method
>>> (`org.apache.manifoldcf.crawler.jobs.JobManager.getNextDeletableDocuments(String,
>>> int, long)`). But I can't find any logic that actually deletes the
>>> documents. I've searched throught the sources, but status
>>> `STATUS_BEINGDELETED` mentioned mostly in `NOT EXISTS ...` queries.
>>> Searching in reverse order from `JobQueue`
>>> (`org.apache.manifoldcf.crawler.jobs.JobQueue`) also doesn't give result to
>>> me. I will be appreciated if somewone can point where to look, so I can
>>> debug and check what conditions are preventing documents to be removed.
>>>
>>> Thank you!
>>>
>>> With respect,
>>> Artem Abeleshev
>>>
>>

Re: JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy

2023-02-01 Thread Karl Wright

It looks like you are running with a profiler?  That uses a lot of memory.
Karl


On Wed, Feb 1, 2023 at 8:06 AM Bisonti Mario 
wrote:

> This is my hs_err_pid_.log
>
>
>
> Command Line: -Xms32768m -Xmx32768m
> -Dorg.apache.manifoldcf.configfile=./properties.xml
> -Djava.security.auth.login.con
>
> fig= -Dorg.apache.manifoldcf.processid=A
> org.apache.manifoldcf.agents.AgentRun
>
>
>
> .
>
> .
>
> .
>
> CodeHeap 'non-profiled nmethods': size=120032Kb used=23677Kb
> max_used=23677Kb free=96354Kb
>
> CodeHeap 'profiled nmethods': size=120028Kb used=20405Kb max_used=27584Kb
> free=99622Kb
>
> CodeHeap 'non-nmethods': size=5700Kb used=1278Kb max_used=1417Kb
> free=4421Kb
>
> Memory: 4k page, physical 72057128k(7300332k free), swap 4039676k(4039676k
> free)
>
> .
>
> .
>
>
>
> Perhaps could be a RAM problem?
>
>
>
> Thanks a lot
>
>
>
>
>
>
>
>
>
> *Da:* Bisonti Mario
> *Inviato:* venerdì 20 gennaio 2023 10:28
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: JCIFS: Possibly transient exception detected on attempt 1
> while getting share security: All pipe instances are busy
>
>
>
> I see that the agent crashed:
>
> #
>
> # A fatal error has been detected by the Java Runtime Environment:
>
> #
>
> #  Internal Error (g1ConcurrentMark.cpp:1665), pid=2537463, tid=2537470
>
> #  fatal error: Overflow during reference processing, can not continue.
> Please increase MarkStackSizeMax (current value: 16777216) and restart.
>
> #
>
> # JRE version: OpenJDK Runtime Environment (11.0.16+8) (build
> 11.0.16+8-post-Ubuntu-0ubuntu120.04)
>
> # Java VM: OpenJDK 64-Bit Server VM (11.0.16+8-post-Ubuntu-0ubuntu120.04,
> mixed mode, tiered, g1 gc, linux-amd64)
>
> # Core dump will be written. Default location: Core dumps may be processed
> with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E"
> (or dumping to
> /opt/manifoldcf/multiprocess-zk-example-proprietary/core.2537463)
>
> #
>
> # If you would like to submit a bug report, please visit:
>
> #   https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
>
> #
>
>
>
> ---  S U M M A R Y 
>
>
>
> Command Line: -Xms32768m -Xmx32768m
> -Dorg.apache.manifoldcf.configfile=./properties.xml
> -Djava.security.auth.login.config= -Dorg.apache.manifoldcf.processid=A
> org.apache.manifoldcf.agents.AgentRun
>
>
>
> Host: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 8 cores, 68G, Ubuntu
> 20.04.4 LTS
>
> Time: Fri Jan 20 09:38:54 2023 CET elapsed time: 54532.106681 seconds (0d
> 15h 8m 52s)
>
>
>
> ---  T H R E A D  ---
>
>
>
> Current thread (0x7f051940a000):  VMThread "VM Thread" [stack:
> 0x7f051c50a000,0x7f051c60a000] [id=2537470]
>
>
>
> Stack: [0x7f051c50a000,0x00007f051c60a000],  sp=0x7f051c608080,
> free space=1016k
>
> Native frames: (J=compiled Java code, A=aot compiled Java code,
> j=interpreted, Vv=VM code, C=native code)
>
> V  [libjvm.so+0xe963a9]
>
> V  [libjvm.so+0x67b504]
>
> V  [libjvm.so+0x7604e6]
>
>
>
>
>
> So, where could I change that parameter?
>
> Is it an Agent configuration?
>
> Thanks a lot
>
> Mario
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* mercoledì 18 gennaio 2023 14:59
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: JCIFS: Possibly transient exception detected on attempt 1
> while getting share security: All pipe instances are busy
>
>
>
> When you get a hang like this, getting a thread dump of the agents process
> is essential to figure out what the issue is.  You can't assume that a
> transient error would block anything because that's not how ManifoldCF
> works, at all.  Errors push the document in question back onto the queue
> with a retry time.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jan 18, 2023 at 6:15 AM Bisonti Mario 
> wrote:
>
> Hi Karl.
>
> But I noted that the job was hanging, the document processed was stucked
> on the same number, no further document processing from the 6 a.m until I
> restart Agent
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* mercoledì 18 gennaio 2023 12:10
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: JCIFS: Possibly transient exception detected on attempt 1
> while getting share security: All pipe instances are busy
>
>
>
> Hi, "Possibly transient issue" means that the error will be retried
> anyway, according to a schedule.  There should not need to be any
> requirement to shut down the agents proce

Re: Job stucked with cleaning up status

2023-01-29 Thread Karl Wright

Hi,

2.22 makes no changes to the way document deletions are processed over
probably 10 previous versions of ManifoldCF.

What likely is the case is that the connection to the output for the job
you are cleaning up is down.  When that happens, the documents are queued
but the delete worker threads cannot make any progress.

You can see this maybe by looking at the "Simple Reports" for the job in
question and see what it is doing and why the deletions are not succeeding.

Karl


On Sun, Jan 29, 2023 at 8:18 AM Artem Abeleshev <
artem.abeles...@rondhuit.com> wrote:

> Hi, everyone!
>
> Another problem that I got sometimes. We are using ManifoldCF 2.22.1 with
> multiple nodes in our production. The creation of the MCF job pipeline is
> handled via the API calls from our service. We create jobs, repositories
> and output repositories. The crawler extracts documents and then they are
> pushed to the Solr. The pipeline works OK.
>
> The problem is about deleteing the job. Sometimes the job get stucked with
> a `Cleaning up` status (in DB it has status `e` that corresponds to status
> `STATUS_DELETING`). This time I have used MCF Web Admin to delete the job
> (pressed the delete button on the job list page).
>
> I have checked sources and debug it a bit. The method
> `deleteJobsReadyForDelete()`
> (`org.apache.manifoldcf.crawler.jobs.JobManager.deleteJobsReadyForDelete()`)
> is works OK. It is unable to delete the job cause it still found some
> documents in the document's queue table. The following SQL is executed
> within this method:
>
> ```sql
> select id from jobqueue where jobid = '1658215015582' and (status = 'E' or
> status = 'D') limit 1;
> ```
>
> where `E` status stands for `STATUS_ELIGIBLEFORDELETE` and `D` status
> stands for `STATUS_BEINGDELETED`. If at least one of such a documents is
> found in the queue it will do nothing. At the moment I had a lot of
> documents resided within the `jobqueue` having indicated statuses (actually
> all of them have `D` status).
>
> I see that `Documents delete stuffer thread` is running, and it set status
> `STATUS_BEINGDELETED` to the documents via the
> `getNextDeletableDocuments()` method
> (`org.apache.manifoldcf.crawler.jobs.JobManager.getNextDeletableDocuments(String,
> int, long)`). But I can't find any logic that actually deletes the
> documents. I've searched throught the sources, but status
> `STATUS_BEINGDELETED` mentioned mostly in `NOT EXISTS ...` queries.
> Searching in reverse order from `JobQueue`
> (`org.apache.manifoldcf.crawler.jobs.JobQueue`) also doesn't give result to
> me. I will be appreciated if somewone can point where to look, so I can
> debug and check what conditions are preventing documents to be removed.
>
> Thank you!
>
> With respect,
> Artem Abeleshev
>

Re: JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy

2023-01-18 Thread Karl Wright

When you get a hang like this, getting a thread dump of the agents process
is essential to figure out what the issue is.  You can't assume that a
transient error would block anything because that's not how ManifoldCF
works, at all.  Errors push the document in question back onto the queue
with a retry time.

Karl


On Wed, Jan 18, 2023 at 6:15 AM Bisonti Mario 
wrote:

> Hi Karl.
>
> But I noted that the job was hanging, the document processed was stucked
> on the same number, no further document processing from the 6 a.m until I
> restart Agent
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* mercoledì 18 gennaio 2023 12:10
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: JCIFS: Possibly transient exception detected on attempt 1
> while getting share security: All pipe instances are busy
>
>
>
> Hi, "Possibly transient issue" means that the error will be retried
> anyway, according to a schedule.  There should not need to be any
> requirement to shut down the agents process and restart.
>
> Karl
>
>
>
> On Wed, Jan 18, 2023 at 5:08 AM Bisonti Mario 
> wrote:
>
> Hi.
>
> Often, I obtain the error:
>
> WARN 2023-01-18T06:18:19,316 (Worker thread '89') - JCIFS: Possibly
> transient exception detected on attempt 1 while getting share security: All
> pipe instances are busy.
>
> jcifs.smb.SmbException: All pipe instances are busy.
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus2(SmbTransportImpl.java:1441)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus(SmbTransportImpl.java:1552)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1007)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:409)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeImpl.send(SmbTreeImpl.java:472)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send0(SmbTreeConnection.java:399)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:314)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:294)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:130)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:117)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.openUnshared(SmbFile.java:665)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbPipeHandleImpl.ensureOpen(SmbPipeHandleImpl.java:169)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbPipeHandleImpl.sendrecv(SmbPipeHandleImpl.java:250)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendReceiveFragment(DcerpcPipeHandle.java:113)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:243)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:216)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:234)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2337)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2468)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1243)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:647)
> [mcf-jcifs-connector.jar:?]
>
>
>
> So, I have to stop the agent, restart it, and the crwling continues.
>
>
>
> How could I solve my issue?
> Thanks a lot.
>
> Mario
>
>

Re: JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy

2023-01-18 Thread Karl Wright

Hi, "Possibly transient issue" means that the error will be retried anyway,
according to a schedule.  There should not need to be any requirement to
shut down the agents process and restart.
Karl

On Wed, Jan 18, 2023 at 5:08 AM Bisonti Mario 
wrote:

> Hi.
>
> Often, I obtain the error:
>
> WARN 2023-01-18T06:18:19,316 (Worker thread '89') - JCIFS: Possibly
> transient exception detected on attempt 1 while getting share security: All
> pipe instances are busy.
>
> jcifs.smb.SmbException: All pipe instances are busy.
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus2(SmbTransportImpl.java:1441)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus(SmbTransportImpl.java:1552)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1007)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:409)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeImpl.send(SmbTreeImpl.java:472)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send0(SmbTreeConnection.java:399)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:314)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:294)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:130)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:117)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.openUnshared(SmbFile.java:665)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbPipeHandleImpl.ensureOpen(SmbPipeHandleImpl.java:169)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbPipeHandleImpl.sendrecv(SmbPipeHandleImpl.java:250)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendReceiveFragment(DcerpcPipeHandle.java:113)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:243)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:216)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:234)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2337)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2468)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1243)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:647)
> [mcf-jcifs-connector.jar:?]
>
>
>
> So, I have to stop the agent, restart it, and the crwling continues.
>
>
>
> How could I solve my issue?
> Thanks a lot.
>
> Mario
>

Re: Help for subscribing the user mailing list of MCF

2023-01-10 Thread Karl Wright

Hmm - I haven't heard of difficulties like this before.  The mail manager
is used apache-wide; if it doesn't work the best thing to do would be to
create an infra ticket in JIRA.

Karl

On Tue, Jan 10, 2023 at 3:50 AM Koji Sekiguchi 
wrote:

> Hi Karl, everyone!
>
> I'm writing to the moderator of the MCF mailing list.
>
> I'd like you to help my colleague to subscribe to MCF user mailing list.
> He's tried to subscribe several times by sending the request to
> user-subscr...@manifoldcf.apache.org but he said that it seemed that
> they were just ignored and he couldn't get any responses from the
> system.
> The email address is abeleshev at gmail dot com.
>
> He has some questions and wants to contribute something if possible.
>
> Thanks!
>
> Koji
>

Re: Is Manifold capable of handling these kind of files

2022-12-23 Thread Karl Wright

The internals of ManifoldCF will handle this fine if you are sure to set
the encoding of your database to be UTF-8.  However, I don't know about the
JCIFS library, and whether there might be a restriction on characters in
that code base.  I think you'd have to just try it and see, frankly.

Karl

On Fri, Dec 23, 2022 at 6:52 AM Priya Arora  wrote:

> Hi
>
> Is Manifold capable of handling this kind (ingesting) of file in window
> shares connector which has special characters like these
>
> demo/11208500/11208550/I. Proposal/PHASE II/220808
> Input/__MACOSX/虎尾/._62A33A6377CF08B472CC2AB562BD8B5D.JPG
>
>
> Any reply would be appreciated
>

Re: Frequent error while window shares job

2022-08-22 Thread Karl Wright

You will need to contact the current maintainers of the Jcifs library to
get answers to these questions.

Karl


On Mon, Aug 22, 2022 at 3:27 AM ritika jain 
wrote:

> Hi All,
>
> I have a Windows shared job to crawl files from samba server, it's a  huge
> job to crawl documents in millions(about 10). While running a job , we
> encounter two types of errors very frequently.
>
> 1)  WARN 2022-08-19T17:17:05,175 (Worker thread '7') - JCIFS: Possibly
> transient exception detected on attempt 3 while getting share security:
> Disconnecting during tree connect
> jcifs.smb.SmbException: Disconnecting during tree connect-- in what case
> it can come
>
> 2) WARN 2019-08-25T15:02:27,416 (Worker thread '11') - Service
> interruption reported for job 1565115290083 connection 'fs_vwoaahvp319':
> Timeout or other service interruption: The process cannot access the file
> because it is being used by another process.
>
> What can be the reason for this?. Can anybody please help how we can make
> the job skip the error (if for any particular file), and then let the job
> run without an abort.
>
> Thanks
> Ritika
>

Re: Can't delete a job when solr output connection can't connect to the instance.

2022-06-14 Thread Karl Wright

Remember, there is already a "forget" button on the output connection,
which will remove everything associated with the connection.  It's meant to
be used when the output index has been reset and is empty.  I'm not sure
what you'd do different functionally.

Karl


On Tue, Jun 14, 2022 at 2:04 AM Koji Sekiguchi 
wrote:

> +1.
>
> I respect for the design concept of ManifoldCF, but I think force delete
> options make MCF more
> useful for those who use MCF as crawler. Adding force delete options
> doesn't change default
> behaviors and it doesn't break back-compatibility.
>
> Koji
>
> On 2022/06/14 14:46, Ricardo Ruiz wrote:
> > Hi Karl
> > We are using  ManifoldCF as a crawler more than a synchronizer. We are
> thinking of contributing to
> > ManifoldCf by including a force job delete and force output connector
> delete, considering of course
> > the things that need to be deleted with them (BD, etc). Do you think
> this is possible?
> > We think that not only us but the community would be benefited from this
> kind of functionality.
> >
> > Ricardo.
> >
> > On Mon, Jun 13, 2022 at 7:34 PM Karl Wright  daddy...@gmail.com>> wrote:
> >
> > Because ManifoldCF is not just a crawler, but a synchonizer, a job
> represents and includes a
> > list of documents that have been indexed.  Deleting the job requires
> deleting the documents that
> > have been indexed also.  It's part of the basic model.
> >
> > So if you tear down your target output instance and then try to tear
> down the job, it won't
> > work.  ManifoldCF won't just throw away the memory of those
> documents and act as if nothing
> > happened.
> >
> > If you're just using ManifoldCF as a crawler, therefore, your fix is
> about as good as it gets.
> >
> > You can get into similar trouble if, for example, you reinstall
> ManifoldCF but forget to include
> > a connector class that was there before.  Carnage ensues.
> >
> > Karl
> >
> >
> > On Mon, Jun 13, 2022 at 1:39 AM Ricardo Ruiz  > <mailto:ricrui3s...@gmail.com>> wrote:
> >
> > Hi all
> > My team uses mcf to crawl documents and index into solr
> instances, but for reasons beyond
> > our control, sometimes the instances or collections are deleted.
> > When we try to delete a job and the solr instance or collection
> doesn't exist anymore, the
> > job reaches the "End notification" status and gets stuck there.
> No other job can be aborted
> > or deleted until the initial error is fixed.
> >
> > We are able to clean up the errors following the next steps:
> >
> > 1.  Reconfigure the output connector to an existing Solr
> instance and collection
> > 2.  Reset the output connection, so it forgets any indexed
> documents.
> > 3.  Reset the job, so it forgets any indexed documents.
> > 4.  Restart the ManifoldCF server.
> >
> > Is there any other way we can solve this error? Is there any way
> we can force delete the job
> > if we don't care about the job's documents anymore?
> >
> > Thanks in advance.
> > Ricardo.
> >
>

Re: Can't delete a job when solr output connection can't connect to the instance.

2022-06-13 Thread Karl Wright

Because ManifoldCF is not just a crawler, but a synchonizer, a job
represents and includes a list of documents that have been indexed.
Deleting the job requires deleting the documents that have been indexed
also.  It's part of the basic model.

So if you tear down your target output instance and then try to tear down
the job, it won't work.  ManifoldCF won't just throw away the memory of
those documents and act as if nothing happened.

If you're just using ManifoldCF as a crawler, therefore, your fix is about
as good as it gets.

You can get into similar trouble if, for example, you reinstall ManifoldCF
but forget to include a connector class that was there before.  Carnage
ensues.

Karl

On Mon, Jun 13, 2022 at 1:39 AM Ricardo Ruiz  wrote:

> Hi all
> My team uses mcf to crawl documents and index into solr instances, but for
> reasons beyond our control, sometimes the instances or collections are
> deleted.
> When we try to delete a job and the solr instance or collection doesn't
> exist anymore, the job reaches the "End notification" status and gets stuck
> there. No other job can be aborted or deleted until the initial error is
> fixed.
>
> We are able to clean up the errors following the next steps:
>
> 1.  Reconfigure the output connector to an existing Solr instance and
> collection
> 2.  Reset the output connection, so it forgets any indexed documents.
> 3.  Reset the job, so it forgets any indexed documents.
> 4.  Restart the ManifoldCF server.
>
> Is there any other way we can solve this error? Is there any way we can
> force delete the job if we don't care about the job's documents anymore?
>
> Thanks in advance.
> Ricardo.
>

Re: Job Service Interruption- and stops

2022-04-29 Thread Karl Wright

" repeated service interruption" means that it happens again and again.

For this particular document, the problem is that the error we are seeing
is: "The process cannot access the file because it is being used by another
process."

ManifoldCF assumes that if it retries enough it should be able to read the
document eventually.  In this case, if it cannot read the document after 6+
hours, it assumes something is wrong and stops the job.  We can make it
continue at this point but the issue is that you shouldn't be seeing such
an error for such a long period of time.  Perhaps you might want to
research why this is taking place.

Karl

On Fri, Apr 29, 2022 at 4:54 AM ritika jain 
wrote:

> Hi All,
>
> With the window shares connector, on the server I am getting this
> exception and due to repeated service interruption *job stops.*
>
> Error: Repeated service interruptions - failure processing document: The
> process cannot access the file because it is being used by another process.
>
> How we can prevent this.
> I read in the code that it retries the document. But due to service
> interruptions, the jobs stopped.
>
>
> Thanks
> Ritika
>

Re: Log4j Update Doubt

2022-03-15 Thread Karl Wright

We cannot do back patches of older versions of ManifoldCF.  There is a new
release which shipped in January that addresses log4j issues.  I suggest
updating to that.

Karl


On Tue, Mar 15, 2022 at 8:59 AM ritika jain 
wrote:

> Hi,
>
> How manifoldcf uses log4j files in bin directory/distribution. If this is
> the location "D:\\Manifoldcf\apache-manifoldcf-2.14\lib" that is the lib
> folder only.(for physical file presence)
>
> Also if the log4j dependency issue has been resolved and the version 2.15
> or higher is updated, then will it be reflected in Manifoldcf 2.14 version
> also. If not , can you help to let me know at what places the log4j2.4.1
> jar files should be replaced with 2.15.
>
> When the log4j2.15 jar has been manually replaced in 'lib' folder  and the
> older version is deleted (2.4.1), got this as error
> [image: image.png]
>
> What other location is needed to have the latest jar file.
>
> Thanks
> Ritika
>

Re: Manifoldcf freezes and sit idle

2022-01-31 Thread Karl Wright

As I've mentioned before, the best way to diagnose problems like this is to
get a thread dump of the agents process.  There are many potential reasons
it could occur, ranging from stuck locks to resource starvation.  What
locking model are you using?

Karl

On Mon, Jan 31, 2022 at 6:02 AM ritika jain 
wrote:

> Hi,
>
> I am using Manifoldcf 2.14, web connector and Elastic as output.
> I have observed after a certain time period of continuous run job freezes
> and does not do/process anything. Simple history shows nothing after a
> certain process, and it's not for one job it has been observed for 3
> different jobs , also checked for a certain document (that seems NOT to be
> the case).
>
> Only after restarting the docker container helps. After restarting the job
> continues
>
> What can be the possible reason for this? How it can be prevented for PROD.
>
> Thanks
> Ritika
>

Re: Log4j dependency

2021-12-14 Thread Karl Wright

ManifoldCF framework and connectors use log4j 2.x to dump information to
the ManifoldCF log file.

Please read the following page:

https://logging.apache.org/log4j/2.x/security.html

Specifically, this part:

'Descripton: Apache Log4j2 <=2.14.1 JNDI features used in configuration,
log messages, and parameters do not protect against attacker controlled
LDAP and other JNDI related endpoints. An attacker who can control log
messages or log message parameters can execute arbitrary code loaded from
LDAP servers when message lookup substitution is enabled. From log4j
2.15.0, this behavior has been disabled by default.'

In other words, unless you are allowing external people access to the
crawler UI or to the API, it's impossible to exploit this in ManifoldCF.

However, in the interest of assuring people, we are updating this core
dependency to the recommended 2.15.0 anyway.  The release is scheduled by
the end of December.

Karl

On Tue, Dec 14, 2021 at 4:41 AM ritika jain 
wrote:

> .Hi All,
>
> How does manifold.cf use log4j. When I checked pom.xml of ES connector ,
> it is shown as an *exclusion *of maven dependency.
> [image: image.png]
>
> But when checked in Project's downloaded Dependencies, It shows it being
> used and downloaded.
>
> [image: image.png]
> How does manifold use log 4j and how can we change the version of it.
>
> Thanks
> Ritika
>

Re: Manifoldcf background process

2021-11-18 Thread Karl Wright

The degree of parallelism can be controlled in two ways.
The first way is to set the number of worker threads to something
reasonable.  Usually, this is no more than about 2x the number of
processors you have.
The second way is to control the number of connections in your jcifs
connector to keep it at something reasonable, e.g. 4 (because windows SMB
is really not good at handling more than that anyway).

These two controls are independent of each other.  From your description,
it sounds like the parameter you want to set is not the number of worker
threads but rather the number of connections.  But setting both properly
certainly will help.  The reason that having a high worker thread count is
bad is because you use up some amount of memory for each active thread, and
that means if you give too big a value you need to give ManifoldCF way too
much memory, and you won't be able to compute it in advance either.

Karl

On Thu, Nov 18, 2021 at 2:49 AM ritika jain 
wrote:

> Hi All,
>
> I would like to understand the background process of Manifoldcf windows
> shares jobs , and how it processes the path mentioned in the jobs
> configuration.
>
> I am creating a dynamic job via API using PHP which will pick up approx
> 70k of documents and a dynamic job with  70k of different paths mentioned
> in the job and mention folder-subfolders path otherwise and file name in
> filespec.
>
> My question is, how does manifold work in the background to access all
> different folders at a time. Because mostly all files correspond to
> different folders. Does manifold loads while fetching all folder
> permissions and accessing folder/subfolders files. How does it fetch
> permission for one folder say for path 1 and simultaneously fetch different
> folder permission/access for say path2.
> Does it load the manifold. Because when this job is running then
> manifoldcf seems to be under heavy load and it gets really really slow and
> has to restart the docker container every 15-20 min.
>
> How can a job be run efficiently?
>
> Thanks
> Ritika
>
>

Re: Manifold Job process isssue

2021-11-15 Thread Karl Wright

; jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:489)
> at jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:465)
> at
> jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:426)
> at jcifs.smb.SmbFile.ensureTreeConnected(SmbFile.java:551)
> at jcifs.smb.SmbFile.exists(SmbFile.java:845)
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileExists(SharedDriveConnector.java:2220)
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:610)
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [Worker thread '10'] WARN jcifs.smb.SmbTransportImpl - Disconnecting
> transport while still in use Transport41[homestore.directory.intra/
> 136.231.158.104:445,state=5,signingEnforced=false,usage=5]:
> [SmbSession[credentials=svc_EScrawl,targetHost=homestore.directory.intra,targetDomain=null,uid=0,connectionState=2,usage=3]]
> [Worker thread '10'] WARN jcifs.smb.SmbSessionImpl - Logging off session
> while still in use
> SmbSession[credentials=svc_EScrawl,targetHost=homestore.directory.intra,targetDomain=null,uid=0,connectionState=3,usage=3]:[SmbTree[share=WINHOMES,service=?,tid=-1,inDfs=false,inDomainDfs=false,connectionState=1,usage=1]]
> [Worker thread '10'] WARN jcifs.util.transport.Transport - sendrecv failed
> jcifs.util.transport.RequestTimeoutException: Transport44 timedout waiting
> for response to
> command=SMB2_TREE_CONNECT,status=0,flags=0x0000,mid=4,wordCount=0,byteCount=72
> at
> jcifs.util.transport.Transport.waitForResponses(Transport.java:365)
> at jcifs.util.transport.Transport.sendrecv(Transport.java:232)
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1
>
>
> On Tue, Nov 9, 2021 at 6:19 PM Karl Wright  wrote:
>
>> One hour is quite a lot and will wreak havoc on the document queue.
>> Karl
>>
>>
>> On Tue, Nov 9, 2021 at 7:08 AM ritika jain 
>> wrote:
>>
>>> I have checked, there is only one hour time difference between docker
>>> container and docker host
>>>
>>> On Tue, Nov 9, 2021 at 4:41 PM Karl Wright  wrote:
>>>
>>>> If your docker image's clock is out of sync badly with the real world,
>>>> then System.currentTimeMillis() may give bogus values, and ManifoldCF uses
>>>> that to manage throttling etc.  I don't know if that is the correct
>>>> explanation but it's the only thing I can think of.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Nov 9, 2021 at 4:56 AM ritika jain 
>>>> wrote:
>>>>
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I am using window shares connector , manifoldcf 2.14 and ES as output.
>>>>> I have configured a job to process 60k of documents, Also these documents
>>>>> are new and do not have corresponding values in DB and ES index.
>>>>>
>>>>> So ideally it should process/Index the documents as soon as the job
>>>>> starts.
>>>>> But Manifoldcf does not process anything for many hours of job start
>>>>> up.I have tried restarting the docker container as well. But it didn't 
>>>>> help
>>>>> much. Also logs only correspond to Long running queries.
>>>>>
>>>>> Why does the manifold behave like that?
>>>>>
>>>>> Thanks
>>>>> Ritika
>>>>>
>>>>

Re: Manifold Job process isssue

2021-11-09 Thread Karl Wright

One hour is quite a lot and will wreak havoc on the document queue.
Karl


On Tue, Nov 9, 2021 at 7:08 AM ritika jain  wrote:

> I have checked, there is only one hour time difference between docker
> container and docker host
>
> On Tue, Nov 9, 2021 at 4:41 PM Karl Wright  wrote:
>
>> If your docker image's clock is out of sync badly with the real world,
>> then System.currentTimeMillis() may give bogus values, and ManifoldCF uses
>> that to manage throttling etc.  I don't know if that is the correct
>> explanation but it's the only thing I can think of.
>>
>> Karl
>>
>>
>> On Tue, Nov 9, 2021 at 4:56 AM ritika jain 
>> wrote:
>>
>>>
>>> Hi All,
>>>
>>> I am using window shares connector , manifoldcf 2.14 and ES as output. I
>>> have configured a job to process 60k of documents, Also these documents are
>>> new and do not have corresponding values in DB and ES index.
>>>
>>> So ideally it should process/Index the documents as soon as the job
>>> starts.
>>> But Manifoldcf does not process anything for many hours of job start
>>> up.I have tried restarting the docker container as well. But it didn't help
>>> much. Also logs only correspond to Long running queries.
>>>
>>> Why does the manifold behave like that?
>>>
>>> Thanks
>>> Ritika
>>>
>>

Re: Manifold Job process isssue

2021-11-09 Thread Karl Wright

If your docker image's clock is out of sync badly with the real world, then
System.currentTimeMillis() may give bogus values, and ManifoldCF uses that
to manage throttling etc.  I don't know if that is the correct explanation
but it's the only thing I can think of.

Karl

On Tue, Nov 9, 2021 at 4:56 AM ritika jain  wrote:

>
> Hi All,
>
> I am using window shares connector , manifoldcf 2.14 and ES as output. I
> have configured a job to process 60k of documents, Also these documents are
> new and do not have corresponding values in DB and ES index.
>
> So ideally it should process/Index the documents as soon as the job starts.
> But Manifoldcf does not process anything for many hours of job start up.I
> have tried restarting the docker container as well. But it didn't help
> much. Also logs only correspond to Long running queries.
>
> Why does the manifold behave like that?
>
> Thanks
> Ritika
>

Re: Duplicate key error

2021-10-27 Thread Karl Wright

We see errors like this only because MCF is a highly multithreaded
application, and two threads sometimes are able to collide in what they are
doing even though they are transactionally separated.  That is because of
bugs in the database software.  So if you restart the job it should not
encounter the same problem.

If the problem IS repeatable, we will of course look deeper into what is
going on.

Karl

On Wed, Oct 27, 2021 at 9:52 AM Karl Wright  wrote:

> Is it repeatable?  My guess is it is not repeatable.
> Karl
>
> On Wed, Oct 27, 2021 at 4:43 AM ritika jain 
> wrote:
>
>> So , it can be left as it is.. ? because it is preventing job to complete
>> and its stopping.
>>
>> On Tue, Oct 26, 2021 at 8:40 PM Karl Wright  wrote:
>>
>>> That's a database bug.  All of our underlying databases have some bugs
>>> of this kind.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Oct 26, 2021 at 9:17 AM ritika jain 
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> While using Manifoldcf 2.14 with Web connector and ES connector. After
>>>> a certain time of continuing the job (jobs ingest some documents in lakhs),
>>>> we got this error on PROD.
>>>>
>>>> Can anybody suggest what could be the problem?
>>>>
>>>> PRODUCTION MANIFOLD ERROR:
>>>>
>>>> Error: ERROR: duplicate key value violates unique constraint
>>>> "ingeststatus_pkey" Detail: Key (id)=(1624***7) already exists.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>

Re: Duplicate key error

2021-10-27 Thread Karl Wright

Is it repeatable?  My guess is it is not repeatable.
Karl

On Wed, Oct 27, 2021 at 4:43 AM ritika jain 
wrote:

> So , it can be left as it is.. ? because it is preventing job to complete
> and its stopping.
>
> On Tue, Oct 26, 2021 at 8:40 PM Karl Wright  wrote:
>
>> That's a database bug.  All of our underlying databases have some bugs of
>> this kind.
>>
>> Karl
>>
>>
>> On Tue, Oct 26, 2021 at 9:17 AM ritika jain 
>> wrote:
>>
>>> Hi All,
>>>
>>> While using Manifoldcf 2.14 with Web connector and ES connector. After a
>>> certain time of continuing the job (jobs ingest some documents in lakhs),
>>> we got this error on PROD.
>>>
>>> Can anybody suggest what could be the problem?
>>>
>>> PRODUCTION MANIFOLD ERROR:
>>>
>>> Error: ERROR: duplicate key value violates unique constraint
>>> "ingeststatus_pkey" Detail: Key (id)=(1624***7) already exists.
>>>
>>>
>>> Thanks
>>>
>>>
>>>

Re:

2021-10-26 Thread Karl Wright

That's a database bug.  All of our underlying databases have some bugs of
this kind.

Karl


On Tue, Oct 26, 2021 at 9:17 AM ritika jain 
wrote:

> Hi All,
>
> While using Manifoldcf 2.14 with Web connector and ES connector. After a
> certain time of continuing the job (jobs ingest some documents in lakhs),
> we got this error on PROD.
>
> Can anybody suggest what could be the problem?
>
> PRODUCTION MANIFOLD ERROR:
>
> Error: ERROR: duplicate key value violates unique constraint
> "ingeststatus_pkey" Detail: Key (id)=(1624***7) already exists.
>
>
> Thanks
>
>
>

Re: Windows Shares job-Limit on defining no of paths

2021-10-25 Thread Karl Wright

The only limit is that the more you add, the slower it gets.

Karl


On Mon, Oct 25, 2021 at 6:06 AM ritika jain 
wrote:

> Hi ,
> Is there any limit on the number of paths we can define in job using
> Repository as Window Shares and ES as Output
>
> Thanks
>

Re: Null Pointer Exception

2021-10-25 Thread Karl Wright

The API should really catch this situation.  Basically, you are calling a
function that requires an input but you are not providing one.  In that
case the API sets the input to "null", and the detailed operation is
called.  The detailed operation is not expecting a null input.

This is API piece that is not flagging the error properly:

// Parse the input
Configuration input;

if (protocol.equals("json"))
{
  if (argument.length() != 0)
  {
input = new Configuration();
input.fromJSON(argument);
  }
  else
input = null;
}
else
{
  response.sendError(response.SC_BAD_REQUEST,"Unknown API protocol:
"+protocol);
  return;
}

Since this is POST, it should assume that the input cannot be null, and if
it is, it's a bad request.

Karl


On Mon, Oct 25, 2021 at 2:44 AM ritika jain 
wrote:

> Hi,
>
> I am getting Null pointer exceptions while creating a job programmatic
> approach via PHP.
> Can anybody suggest the reason for this?.
>
>Error 500 Server Error 
> HTTP ERROR 500 Problem accessing
> /mcf-api-service/json/jobs. Reason:  Server ErrorCaused
> by:java.lang.NullPointerException at
> org.apache.manifoldcf.agents.system.ManifoldCF.findConfigurationNode(ManifoldCF.java:208)
> at
> org.apache.manifoldcf.crawler.system.ManifoldCF.apiPostJob(ManifoldCF.java:3539)
> at
> org.apache.manifoldcf.crawler.system.ManifoldCF.executePostCommand(ManifoldCF.java:3585)
> at
> org.apache.manifoldcf.apiservlet.APIServlet.executePost(APIServlet.java:576)
> at org.apache.manifoldcf.apiservlet.APIServlet.doPost(APIServlet.java:175)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769) at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.eclipse.jetty.server.Server.handle(Server.java:497) at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311) at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
> at
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
> at java.lang.Thread.run(Thread.java:748)  Powered by
> Jetty://  
>
>

Re: Error: Repeated service interruptions - failure processing document: Read timed out

2021-09-30 Thread Karl Wright

Hi,

You say this is a "Tika error".  Is this Tika as a stand-alone service?  I
do not recognize any ManifoldCF classes whatsoever in this thread dump.

If this is Tika, I suggest contacting the Tika team.

Karl


On Thu, Sep 30, 2021 at 3:02 AM Bisonti Mario 
wrote:

> Additional info.
>
>
>
> I am using 2.17-dev version
>
>
>
>
>
>
>
> *Da:* Bisonti Mario
> *Inviato:* martedì 28 settembre 2021 17:01
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Error: Repeated service interruptions - failure processing
> document: Read timed out
>
>
>
> Hello
>
>
>
> I have error on a Job that parses a network folder.
>
>
>
> This is the tika error:
> 2021-09-28 16:14:50 INFO  Server:415 - Started @1367ms
>
> 2021-09-28 16:14:50 WARN  ContextHandler:1671 - Empty contextPath
>
> 2021-09-28 16:14:50 INFO  ContextHandler:916 - Started
> o.e.j.s.h.ContextHandler@3dd69f5a{/,null,AVAILABLE}
>
> 2021-09-28 16:14:50 INFO  TikaServerCli:413 - Started Apache Tika server
> at http://sengvivv02.vimar.net:9998/
>
> 2021-09-28 16:15:04 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:26:46 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:26:46 INFO  TikaResource:484 - tika (application/pdf)
>
> 2021-09-28 16:27:23 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:27:24 INFO  TikaResource:484 - tika (application/pdf)
>
> 2021-09-28 16:27:26 INFO  MetadataResource:484 - meta (application/pdf)
>
> 2021-09-28 16:27:26 INFO  TikaResource:484 - tika (application/pdf)
>
> 2021-09-28 16:30:28 WARN  PhaseInterceptorChain:468 - Interceptor for {
> http://resource.server.tika.apache.org/}MetadataResource has thrown
> exception, unwinding now
>
> org.apache.cxf.interceptor.Fault: Could not send Message.
>
> at
> org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:67)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>
> at org.eclipse.jetty.server.Server.handle(Server.java:516)
>
> at
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
>
> at
> org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
>
> at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
>
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
>
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>
> at
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
>
> at
> org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
>
> at java.base/java.lang.Thread.run(Thread.java:834)
>
> Caused by: org.eclipse.jetty.io.EofException
>
> at
> org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279)
>
> at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422)
>
> at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:277)
>
> at
> org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
>
> at
> org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:826)
>
> at
>

Re: Tika Parser Issue

2021-09-07 Thread Karl Wright

This is something you should contact the Tika project about.
Karl


On Tue, Sep 7, 2021 at 8:46 AM ritika jain  wrote:

> Hi All,
>
> I am using tika-core 1.21 and tika-parsers 1.21 jar files as tika
> dependencies in Manifoldcf 2.14 version.
> Getting some issues while parsing *PDF *files. Some strange characters
> appeared, tried changing Tika jar files version also 1.24 and 1.27 (it
> didn't even extract files correctly).
>
> [image: 365.jfif]
> Also checked with the document content, it seems to be fine.
> Can anybody help me on this.
>
> Thanks
> Ritika
>

Re: Query:JCIFS connector

2021-08-23 Thread Karl Wright

I have a work day today, with limited time.
The UI is what it is; it does not have capabilities beyond what is stated
in the UI and in the manual.  It's meant to allow construction of paths
piece by piece, not by full subdirectory at a time.

You can obviously use the API if you want to construct path specifications
some other way.  It sounds like you are doing things programmatically
anyway so I would definitely look into that.

Karl

On Mon, Aug 23, 2021 at 3:52 AM ritika jain 
wrote:

> Can anybody have a clue on this ?
>
> On Fri, Aug 20, 2021 at 12:33 PM ritika jain 
> wrote:
>
>> Hi All,
>>
>> I am having a query , is there any way using which we can mention
>> subdirectories' path in the file spec of Window shares connector.
>>
>> Like my requirement is to mention Most top hierarchical folder on top as
>> mentioned in Screenshot.
>> And in file spec requirement is  to mention file name followed by
>> subdirectories.
>> *Say for example there is a file *
>> E:\sharing\demo\Index.pdf
>>
>> Requirement is to mention sharing at top and rest path at file spec.
>>
>> [image: image.png]
>>
>> Is there any way to do it? Any help would be appreciated
>>
>> Thanks
>> Ritika
>>
>>

Re: Job Deletion query

2021-08-12 Thread Karl Wright

Yes, when you delete a job, the indexed documents associated with that job
are removed from the index.

ManifoldCF is a synchronizer, not a crawler, so when you remove the
synchronization job then if it didn't delete the indexed documents they
would be left dangling.

Karl

On Thu, Aug 12, 2021 at 3:46 AM ritika jain 
wrote:

> Hi All,
>
> When we delete a job in Manifoldcf .. Does it also delete the indexed
> documents via that job from Elastic index as well ?
>
> I understand that when a job is deleted from Manifoldcf interface it will
> delete all the referenced documents via that job from postgres. But why is
> it deleted from the ES index?
>
> Thanks
> Ritika
>

Re: Window shares dynamic Job issue

2021-08-11 Thread Karl Wright

,"_value_":"1599130705168"},{"_type_":"description","_value_":"Demo_job"},{"_type_":"repository_connection","_value_":"mas_Repo"},{"_type_":"document_specification","_children_":[{"_type_":"startpoint","include":[{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pdf","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.doc","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docm","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.docb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dot","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dotx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wpd
>  
> ","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.pptx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.ppt","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp4","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp5","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp6","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.wp7","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsm
>  
> ","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xls","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xls","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsb","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.xlsx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.png","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.jpeg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.bmp","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.gif","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpeg","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.mpg","_value_":"","_attribute_type":"file"},{"_attribute_filespec":"*","_value_":"","_attribute_type":"directory"}],"_attribute_path":"*windows\/Job\/Demo
>  School 
> Network\/Information\*","_value_":""},{"_type_":"maxlength","_value_":"","_attribute_value":"500"},{"_type_":"security","_value_":"","_attribute_value":"on"},{"_type_":"sharesecurity","_value_":"","_attribute_value":"on"},{"_type_":"parentfoldersecurity","_value_":"","_attribute_value":"off"}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"0"},{"_type_":"stage_isoutput","_value_":"false"},{"_type_":"stage_connectionname","_value_":"Tika"},{"_type_":"stage_specification","_children_":[{"_type_":"keepAllMetadata","_value_":"","_attribute_value":"true"},{"_type_":"lowerNames","_value_":"","_attribute_value":"false"},{"_type_":"writeLimit","_value_":"","_attribute_value":""},{"_type_":"ignoreException","_value_":"","_attribute_value":"true"},{"_type_":"boilerplateprocessor","_value_":"","_attribute_value":"de.l3s.boilerpipe.extractors.KeepEverythingExtractor"}]}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"1"},{"_type_":"stage_prerequisite","_value_":"0"},{"_type_":"stage_isoutput","_value_":"false"},{"_type_":"stage_connectionname","_value_":"Metadata
>  
> Adjuster"},{"_type_":"stage_specification","_children_":[{"_type_":"expression","_attribute_parameter":"d_connector_type","_value_":"","_attribute_value":"FileShare"},{"_type_":"expression","_attribute_parameter":"d_description","_value_":"","_attribute_value":"\"${dc:description}\"
>  
> "},{"_type_":"keepAllMetadata","_value_":"","_attribute_value":"true"},{"_type_":"filterEmpty","_value_":"","_attribute_value":"true"}]}]},{"_type_":"pipelinestage","_children_":[{"_type_":"stage_id","_value_":"2"},{"_type_":"stage_prerequisite","_value_":"1"},{"_type_":"stage_isoutput","_value_":"true"},{"_type_":"stage_connectionname","_value_":"Deltares_Output"},{"_type_":"stage_specification"}]},{"_type_":"start_mode","_value_":"manual"},{"_type_":"run_mode","_value_":"scan
>  
> once"},{"_type_":"hopcount_mode","_value_":"accurate"},{"_type_":"priority","_value_":"5"},{"_type_":"recrawl_interval","_value_":"8640"},{"_type_":"max_recrawl_interval","_value_":"infinite"},{"_type_":"expiration_interval","_value_":"infinite"},{"_type_":"reseed_interval","_value_":"360"}]}}
>
> Basically these two job structures are fully same.Except Path:- is
> mentioned as 1) Complete path till File location 2) only path till folders.
>
> In the first  case the ingestion file has a slash at the end and In second 
> case we don't.
>
>
> Thanks'
>
> Ritika
>
>
> On Tue, Aug 10, 2021 at 6:52 PM Karl Wright  wrote:
>
>> I am sorry, but I'm having trouble understanding how exactly you are
>> configuring the JCIFS connector in these two cases.Can you view the job
>> in each case and provide cut-and-paste of the view?
>>
>> Karl
>>
>>
>> On Tue, Aug 10, 2021 at 9:09 AM ritika jain 
>> wrote:
>>
>>> Hi All,
>>>
>>> I am using Window shares connector in 2.14 manifoldcf version and
>>> Elastic as output.
>>> I have created a dynamic manifoldcf job API via which a job will be
>>> created in manifoldcf with inclusions list and path, only particular file
>>> path is to be mentioned . Example file path:- C:/Users/Dell/Desktop/abc.txt.
>>>
>>> A job will be created to crawl only this single file .
>>> *Issue is :-*
>>> When this job ingest document in Elastic search  there is slash, that is
>>> getting appended in the end
>>>
>>> *Ingested file is* :- C:/Users/Dell/Desktop/abc.txt/
>>>
>>> But when same file is crawled via Manifoldcf job settings by mentioning
>>> path till folder structure (as manual job creation does not allow file path
>>> till particular file it allows till folders only).
>>> It does not append /
>>>
>>> *Ingested file in this case:-*
>>> C:/Users/Dell/Desktop/abc.txt
>>> as expected original file.
>>>
>>> *Query*
>>> Why is this the case as it makes searching in ES ambiguous.
>>>
>>> Thanks
>>> Ritika
>>>
>>>
>>>

Re: Window shares dynamic Job issue

2021-08-10 Thread Karl Wright

I am sorry, but I'm having trouble understanding how exactly you are
configuring the JCIFS connector in these two cases.Can you view the job
in each case and provide cut-and-paste of the view?

Karl


On Tue, Aug 10, 2021 at 9:09 AM ritika jain 
wrote:

> Hi All,
>
> I am using Window shares connector in 2.14 manifoldcf version and Elastic
> as output.
> I have created a dynamic manifoldcf job API via which a job will be
> created in manifoldcf with inclusions list and path, only particular file
> path is to be mentioned . Example file path:- C:/Users/Dell/Desktop/abc.txt.
>
> A job will be created to crawl only this single file .
> *Issue is :-*
> When this job ingest document in Elastic search  there is slash, that is
> getting appended in the end
>
> *Ingested file is* :- C:/Users/Dell/Desktop/abc.txt/
>
> But when same file is crawled via Manifoldcf job settings by mentioning
> path till folder structure (as manual job creation does not allow file path
> till particular file it allows till folders only).
> It does not append /
>
> *Ingested file in this case:-*
> C:/Users/Dell/Desktop/abc.txt
> as expected original file.
>
> *Query*
> Why is this the case as it makes searching in ES ambiguous.
>
> Thanks
> Ritika
>
>
>

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright

If you wish to add a feature request, please create a CONNECTORS ticket
that describes the functionality you think the connector should have.

Karl


On Wed, Jul 7, 2021 at 9:29 AM h0444xk8  wrote:

> Hi,
>
> yes, that seems to be the reason. In:
>
>
> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java
>
> there is the following code sequence:
>
> else if (lowercaseLine.startsWith("sitemap:"))
>{
>  // We don't complain about this, but right now we don't
> listen to it either.
>}
>
> But if I have a look at:
>
>
> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
>
> a sitemap containing an urlset seems to be handled
>
> else if (localName.equals("urlset") || localName.equals("sitemapindex"))
>{
>  // Sitemap detected
>  outerTagCount++;
>  return new
>
> UrlsetContextClass(theStream,namespace,localName,qName,atts,documentURI,handler);
>}
>
> So, my question is: is there another way to handle sitemaps inside the
> Web Crawler?
>
> Cheers Sebastian
>
>
>
>
>
> Am 07.07.2021 12:23 schrieb Karl Wright:
>
> > The robots parsing does not recognize the "sitemaps" line, which was
> > likely not in the spec for robots when this connector was written.
> >
> > Karl
> >
> > On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:
> >
> >> Hi,
> >>
> >> I have a general question. Is the Web connector supporting sitemap
> >> files
> >> referenced by the robots.txt? In my use case the robots.txt is stored
> >> in
> >> the root of the website and is referencing two compressed sitemaps.
> >>
> >> Example of robots.txt
> >> 
> >> User-Agent: *
> >> Disallow:
> >> Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz [1]
> >> Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz [2]
> >>
> >> When start crawling in „Simple History" there is an error log entry as
> >> follows:
> >>
> >> Unknown robots.txt line: 'Sitemap:
> >> https://www.example.de/sitemap/en-sitemap.xml.gz [2]'
> >>
> >> Is there a general problem with sitemaps at all or with sitemaps
> >> referenced in robots.txt or with compressed sitemaps?
> >>
> >> Best regards
> >>
> >> Sebastian
>
>
> Links:
> --
> [1] https://www.example.de/sitemap/de-sitemap.xml.gz
> [2] https://www.example.de/sitemap/en-sitemap.xml.gz
>

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright

The robots parsing does not recognize the "sitemaps" line, which was likely
not in the spec for robots when this connector was written.

Karl


On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:

> Hi,
>
> I have a general question. Is the Web connector supporting sitemap files
> referenced by the robots.txt? In my use case the robots.txt is stored in
> the root of the website and is referencing two compressed sitemaps.
>
> Example of robots.txt
> 
> User-Agent: *
> Disallow:
> Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz
> Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz
>
> When start crawling in „Simple History" there is an error log entry as
> follows:
>
> Unknown robots.txt line: 'Sitemap:
> https://www.example.de/sitemap/en-sitemap.xml.gz'
>
> Is there a general problem with sitemaps at all or with sitemaps
> referenced in robots.txt or with compressed sitemaps?
>
> Best regards
>
> Sebastian
>

Re: Manifoldcf Redirection process

2021-05-28 Thread Karl Wright

302 does get recognized as a redirection, yes


On Fri, May 28, 2021 at 5:07 AM ritika jain 
wrote:

> Is the process the same when fetch/process status code returned is 302  ?
  When running a job with web crawler and ES output connector

>>>
> can anybody have a clue about  this
>

Re: Manifoldcf Redirection process

2021-05-19 Thread Karl Wright

ManifoldCF reads all the URLs on its queue.
If it's a 301, it detects this and pushes the new URL onto the document
queue.
When it gets to the new URL, it processes it like any other.

Karl


On Wed, May 19, 2021 at 8:32 AM ritika jain 
wrote:

> Hi
>
> I want to understand the process of "How does manifold.cf handles
> redirection of URL." in case of web crawler connector
>
> If there is a page redirect (through a 301) to another URL, then the next
> crawl will detect the redirect and index the new (final) URL and display it
> in the search results. (instead of the old URL that redirects). Just as is
> also done by search engines like Google / Bing.
>
> Is that true, , what manifold is capable of avoiding the URL that is 301
> and pick the URL to which it is redirected? and ingest that URL .
>
> If not , what process Manifoldcf follows to inges redirection of URL's
>
> Thanks
> Ritika
>
>
>

Re: Interrupted while acquiring credits

2021-05-14 Thread Karl Wright

"crashing the manifold" is probably running out of memory, and it is
probably due to having too many worker threads and insufficient memory, not
the error you found.

If that error caused a problem, it would simply abort the job, not "crash"
Manifold.

Karl


On Fri, May 14, 2021 at 4:10 AM ritika jain 
wrote:

> It retries for 3 times and it usually crashes the manifoldcf.
>
> Similar ticket i observed
> https://issues.apache.org/jira/browse/CONNECTORS-1633, does manifoldcf
> itself capable of skipping the file that cause issue instead of aborting
> the job  or crashing manifold
>
> On Fri, May 14, 2021 at 1:34 PM Karl Wright  wrote:
>
>> '
>>
>> *JCIFS: Possibly transient exception detected on attempt 1 while getting
>> share security'Yes, it is going to retry.*
>>
>> *Karl*
>>
>> On Fri, May 14, 2021 at 1:45 AM ritika jain 
>> wrote:
>>
>>> Hi,
>>> I am using Windows shares connector in manifoldcf 2.14 and ElasticSearch
>>> connector as Output connector and Tika and Metadata adjuster as
>>> Transformation connector
>>>
>>> Trying to crawl the files from SMB server with 64 GB of server and Start
>>> option file of manifoldcf is being given 32GB of memory
>>>  But many times got different errors while processing documents:-
>>> *1) Access is denied*
>>> *2) ... 23 more*
>>>
>>>
>>> * WARN 2021-05-13T12:33:16,318 (Worker thread '6') - Service
>>> interruption reported for job 1599130705168 connection 'Themas_Repo':
>>> Timeout or other service interruption: Interrupted while acquiring
>>> credits WARN 2021-05-13T12:33:17,315 (Worker thread '6') - JCIFS: Possibly
>>> transient exception detected on attempt 1 while getting share security:
>>> Interrupted while acquiring creditsjcifs.smb.SmbException: Interrupted
>>> while acquiring credits*
>>> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1530)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbSessionImpl.sessionSetupSMB2(SmbSessionImpl.java:549)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbSessionImpl.sessionSetup(SmbSessionImpl.java:483)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:369)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:347)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbTreeImpl.treeConnect(SmbTreeImpl.java:607)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectTree(SmbTreeConnection.java:609)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:563)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:484)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:460)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:421)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbFile.ensureTreeConnected(SmbFile.java:551)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbPipeHandleImpl.ensureTreeConnected(SmbPipeHandleImpl.java:111)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbPipeHandleImpl.ensureOpen(SmbPipeHandleImpl.java:166)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.smb.SmbPipeHandleImpl.sendrecv(SmbPipeHandleImpl.java:250)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> jcifs.dcerpc.DcerpcPipeHandle.doSendReceiveFragment(DcerpcPipeHandle.java:113)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:243)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:216)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:234)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2337)
>>> ~[jcifs-ng-2.1.2.jar:?]
>>> at
>>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2500)
>>> [mcf-jcifs-connector.jar:2.14]
>>> at
>>> org.apache.

Re: Interrupted while acquiring credits

2021-05-14 Thread Karl Wright

'

*JCIFS: Possibly transient exception detected on attempt 1 while getting
share security'Yes, it is going to retry.*

*Karl*

On Fri, May 14, 2021 at 1:45 AM ritika jain 
wrote:

> Hi,
> I am using Windows shares connector in manifoldcf 2.14 and ElasticSearch
> connector as Output connector and Tika and Metadata adjuster as
> Transformation connector
>
> Trying to crawl the files from SMB server with 64 GB of server and Start
> option file of manifoldcf is being given 32GB of memory
>  But many times got different errors while processing documents:-
> *1) Access is denied*
> *2) ... 23 more*
>
>
> * WARN 2021-05-13T12:33:16,318 (Worker thread '6') - Service interruption
> reported for job 1599130705168 connection 'Themas_Repo': Timeout or other
> service interruption: Interrupted while acquiring credits WARN
> 2021-05-13T12:33:17,315 (Worker thread '6') - JCIFS: Possibly transient
> exception detected on attempt 1 while getting share security: Interrupted
> while acquiring creditsjcifs.smb.SmbException: Interrupted while acquiring
> credits*
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1530)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbSessionImpl.sessionSetupSMB2(SmbSessionImpl.java:549)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbSessionImpl.sessionSetup(SmbSessionImpl.java:483)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:369)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:347)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbTreeImpl.treeConnect(SmbTreeImpl.java:607)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectTree(SmbTreeConnection.java:609)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:563)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectHost(SmbTreeConnection.java:484)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbTreeConnection.connect(SmbTreeConnection.java:460)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbTreeConnection.connectWrapException(SmbTreeConnection.java:421)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbFile.ensureTreeConnected(SmbFile.java:551)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbPipeHandleImpl.ensureTreeConnected(SmbPipeHandleImpl.java:111)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbPipeHandleImpl.ensureOpen(SmbPipeHandleImpl.java:166)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.smb.SmbPipeHandleImpl.sendrecv(SmbPipeHandleImpl.java:250)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendReceiveFragment(DcerpcPipeHandle.java:113)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:243)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:216)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:234)
> ~[jcifs-ng-2.1.2.jar:?]
> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2337)
> ~[jcifs-ng-2.1.2.jar:?]
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2500)
> [mcf-jcifs-connector.jar:2.14]
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1261)
> [mcf-jcifs-connector.jar:2.14]
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:647)
> [mcf-jcifs-connector.jar:2.14]
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
> Caused by: java.io.InterruptedIOException: Interrupted while acquiring
> credits
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:976)
> ~[?:?]
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[?:?]
> ... 23 more
> Caused by: java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> ~[?:1.8.0_292]
> at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:582)
> ~[?:1.8.0_292]
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:959)
> ~[?:?]
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[?:?]
> ... 23 more
>  WARN 2021-05-13T12:33:17,314 (Worker thread '2') - JCIFS: Possibly
> transient exception detected on attempt 2 while getting share security:
> Interrupted while acquiring credits
> jcifs.smb.SmbException: Interrupted while acquiring credits
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.
>
> Do  we have such functionality that in case of any error occurs like this.
> That it should skip the particular record and then continue to process
> further instead of

Re: Notification connector error

2021-05-11 Thread Karl Wright

This used to work fine, but I suspect that when SSH was declared unsafe, it
was disabled, and now only TLS will work.

Karl


On Tue, May 11, 2021 at 12:13 PM  wrote:

> Hello,
>
>
>
> I am trying to use an email notification connector but without success.
> When the connector tries to send an email I keep having the following error:
>
>
>
> Email: Error sending email: Could not convert socket to TLS
>
> javax.mail.MessagingException: Could not convert socket to TLS
>
> at
> com.sun.mail.smtp.SMTPTransport.startTLS(SMTPTransport.java:1918)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> com.sun.mail.smtp.SMTPTransport.protocolConnect(SMTPTransport.java:652)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Service.connect(Service.java:317)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Service.connect(Service.java:176)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Service.connect(Service.java:125)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Transport.send0(Transport.java:194)
> ~[mail-1.4.5.jar:1.4.5]
>
> at javax.mail.Transport.send(Transport.java:124)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> org.apache.manifoldcf.crawler.notifications.email.EmailSession.send(EmailSession.java:112)
> ~[?:?]
>
> at
> org.apache.manifoldcf.crawler.notifications.email.EmailConnector$SendThread.run(EmailConnector.java:963)
> ~[?:?]
>
> Caused by: javax.net.ssl.SSLHandshakeException: No appropriate protocol
> (protocol is disabled or cipher suites are inappropriate)
>
> at
> sun.security.ssl.HandshakeContext.(HandshakeContext.java:170) ~[?:?]
>
> at
> sun.security.ssl.ClientHandshakeContext.(ClientHandshakeContext.java:98)
> ~[?:?]
>
> at
> sun.security.ssl.TransportContext.kickstart(TransportContext.java:221)
> ~[?:?]
>
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:433) ~[?:?]
>
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:411) ~[?:?]
>
> at
> com.sun.mail.util.SocketFetcher.configureSSLSocket(SocketFetcher.java:548)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> com.sun.mail.util.SocketFetcher.startTLS(SocketFetcher.java:485)
> ~[mail-1.4.5.jar:1.4.5]
>
> at
> com.sun.mail.smtp.SMTPTransport.startTLS(SMTPTransport.java:1913)
> ~[mail-1.4.5.jar:1.4.5]
>
> ... 8 more
>
>
>
>
>
> The connector is configured with a gmail SMTP, using the configuration
> recommended by the documentation:
>
>
>
> Hostname: smtp.gmail.com
>
> Port: 587
>
>
>
> Configuration properties:
>
> mail.smtp.ssl.trust : smtp.gmail.com
>
> mail.smtp.starttls.enable : true
>
> mail.smtp.auth : true
>
>
>
>
>
> The username and password I use are correct and I also tried with the
> office365 SMTP and I get the same error.
>
>
>
> I am using openjdk version "11.0.11" 2021-04-20. Do you have any idea
> about my issue ?
>
>
>
> Julien
>
>
>

Re: General questions

2021-04-12 Thread Karl Wright

Hi,

There was a book written but never published on ManifoldCF and how to write
connectors.  It's meant to be extended in that way.  The PDFs for the book
are available for free online, and they are linked through the manifoldcf
web site.

Karl

On Mon, Apr 12, 2021 at 8:49 AM koch  wrote:

> Hi everyone,
>
> I would like to know, what is planned for manifoldCF in the future?
> How much activity is in the project, or is there already an 'end of
> live' in sight?
>
> Is it compatible with Java11 or higher.
>
> Has someone tried to used it in an OSGI container like karaf?
>
> How can i expand manifold. If i would like to write my own repository or
> output connectors,
> do i have to plug them in at build time or is it possible to add
> connectors at runtime?
>
> Any help would be much appriciated.
>
> Kind regards,
> Matthias
>
>
>

Re: Manifoldcf Deletion Process

2021-03-30 Thread Karl Wright

Hi Ritika,

There is no deletion process.  Deletion takes place when a job is run in a
mode where deletion is possible (there are some where it is not).  The way
it takes place depends on the kind of repository connector (what model it
declares itself to use).

For the most common kinds of connectors, the job sequence involves scanning
all documents described by the job.  If the document is gone, it is deleted
right away.  If the document just wasn't accessed on the crawl, then and at
the end, those no-longer-referenced documents are removed.

Karl

On Tue, Mar 30, 2021 at 9:03 AM ritika jain 
wrote:

> Hi All,
>
> I want to understand the process of Manifoldcf Deletion . i.e in which all
> cases Deletion process (When checked in Simple History) executes.
> One case as per my knowledge , is the one whenever Seed URL of a
> particular job is changed.
> What all are the cases when Deletion process runs.
>
> My requirement to research whether Manifold is capable of handling  the
> scenario, say when a URL is existing and ingested in Elastic Index (say:-
> www.abc.com),
>
> Next time when job is run ,say the URL www.abc.com does not exist anymore
> and resulted in 404, Is Manifoldcf is capable of handling(by default) this
> 404 URL and deleting the URL from Database and from ElasticSearch Index(in
> which it was ingested already)..
>
> Any help will be thankful.
> Thanks
> Ritika
>

Re: Another Elasticsearch patch to allow the long URI

2021-03-20 Thread Karl Wright

I have now updated (I think) everything that this patch actually has, save
for one deprecated field substitution (the "types" field is now the "doc_"
field).  I would like to know more about this.  Does the "types" field no
longer work?  Should we send both, in order to be sure that the connector
works with most versions of ElasticSearch?  Please help clarify so that I
can finish this off.

The changes are committed to trunk; I would be very appreciative if  Shirai
Takashi/ 白井隆 reviewed them there.Thanks!
Karl

On Sat, Mar 20, 2021 at 4:32 AM Karl Wright  wrote:

> Hi,
>
> Please see https://issues.apache.org/jira/browse/CONNECTORS-1666 .
>
> I did not commit the patches as given because I felt that the fix was a
> relatively narrow one and it could be implemented with no user
> involvement.  Adding control for the user was therefore beyond the scope of
> the repair.
>
> There are more changes in these patches than just the ID length issue.  I
> am working to add this functionality as well but without anything I would
> consider to be unneeded.
> Karl
>
>
> On Fri, Mar 19, 2021 at 3:48 AM Karl Wright  wrote:
>
>> Thanks for the information.  I'll see what I can do.
>> Karl
>>
>>
>> On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 <
>> shi...@nintendo.co.jp> wrote:
>>
>>> Hi, Karl.
>>>
>>> Karl Wright wrote:
>>> >Hi - I'm still waiting for this patch to be attached to a ticket.  That
>>> is
>>> >the only way I believe we're allowed to accept it legally.
>>>
>>> Do you ask me to send the patch to the JIRA ticket?
>>> I can't access the JIRA because of our firewall.
>>> Sorry.
>>> What can I do without JIRA?
>>>
>>> 
>>> Nintendo, Co., Ltd.
>>> Product Technology Dept.
>>> Takashi SHIRAI
>>> PHONE: +81-75-662-9600
>>> mailto:shi...@nintendo.co.jp
>>>
>>

Re: Another Elasticsearch patch to allow the long URI

2021-03-20 Thread Karl Wright

Hi,

Please see https://issues.apache.org/jira/browse/CONNECTORS-1666 .

I did not commit the patches as given because I felt that the fix was a
relatively narrow one and it could be implemented with no user
involvement.  Adding control for the user was therefore beyond the scope of
the repair.

There are more changes in these patches than just the ID length issue.  I
am working to add this functionality as well but without anything I would
consider to be unneeded.
Karl

On Fri, Mar 19, 2021 at 3:48 AM Karl Wright  wrote:

> Thanks for the information.  I'll see what I can do.
> Karl
>
>
> On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 
> wrote:
>
>> Hi, Karl.
>>
>> Karl Wright wrote:
>> >Hi - I'm still waiting for this patch to be attached to a ticket.  That
>> is
>> >the only way I believe we're allowed to accept it legally.
>>
>> Do you ask me to send the patch to the JIRA ticket?
>> I can't access the JIRA because of our firewall.
>> Sorry.
>> What can I do without JIRA?
>>
>> 
>> Nintendo, Co., Ltd.
>> Product Technology Dept.
>> Takashi SHIRAI
>> PHONE: +81-75-662-9600
>> mailto:shi...@nintendo.co.jp
>>
>

Re: Another Elasticsearch patch to allow the long URI

2021-03-19 Thread Karl Wright

Thanks for the information.  I'll see what I can do.
Karl


On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, Karl.
>
> Karl Wright wrote:
> >Hi - I'm still waiting for this patch to be attached to a ticket.  That is
> >the only way I believe we're allowed to accept it legally.
>
> Do you ask me to send the patch to the JIRA ticket?
> I can't access the JIRA because of our firewall.
> Sorry.
> What can I do without JIRA?
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp
>

Re: Another Elasticsearch patch to allow the long URI

2021-03-18 Thread Karl Wright

Hi - I'm still waiting for this patch to be attached to a ticket.  That is
the only way I believe we're allowed to accept it legally.

Karl


On Thu, Mar 4, 2021 at 7:16 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, Karl.
>
> Karl Wrightさんは書きました:
> >I agree it is unlikely that the JDK will lose support for SHA-1 because it
> >is used commonly, as is MD5.  So please feel free to use it.
>
> I know.
> I think that SHA-1 is better on the whole.
> I don't care that apache-manifoldcf-elastic-id-2.patch.gz is discarded.
>
> SHA-256 is surely safer from the risk of collision.
> But the risk with SHA-1 can be ignored unless intension.
> It should be considered only when ManifoldCF is used for the worldwide
> data.
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp
>

Re: Another Elasticsearch patch to allow the long URI

2021-03-04 Thread Karl Wright

I agree it is unlikely that the JDK will lose support for SHA-1 because it
is used commonly, as is MD5.  So please feel free to use it.

Karl


On Wed, Mar 3, 2021 at 7:54 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, Horn.
>
> Jörn Franke wrote:
> >Makes sense
>
> I don't think that it's easy.
>
>
> >>> Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when
> it will be removed from JDK.
>
> I also know SHA-1 is dangerous.
> Someone can generate the string which is hashed into the same SHA-1 to
> pretend another one.
> Then SHA-1 should not be used with certifications.
> The future JDK may stop using SHA-1 with certifications.
> But JDK will never stop supporting SHA-1 algorism.
>
> If SHA-1 is removed from JDK,
> ManifoldCF can not be built for reasons of another using of SHA-1.
> Some connectors already use SHA-1 as the ID value,
> then the previous saved records will be inaccessible.
> I can use SHA-256 with Elasticsearch connector.
> How should the other SHA-1 be managed?
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp
>

Re: Another Elasticsearch patch to allow the long URI

2021-03-02 Thread Karl Wright

Hi - this is very helpful.  I would like you to officially create a ticket
in Jira: https://issues.apache.org/jira , project "CONNECTORS", and attach
these patches.  Backwards compatibility means that we very likely have to
use the hash approach, and not use the decoding approach.

Thanks,
Karl


On Mon, Mar 1, 2021 at 10:10 PM Shirai Takashi/ 白井隆 
wrote:

> Hi, there.
>
> I've found another trouble in Elasticsearch connector.
> Elasticsearch output connector use the URI string as ID.
> Elasticsearch allows the length of ID no more than 512 bytes.
> If the URL length is too long, it causes HTTP 400 error.
>
> I prepare two solutions with this attached patch.
> The one is URI decoding.
> If the URI includes multibyte characters,
> the ID is URL encoded duplicately.
> Ex) U+3000 -> %E3%80%80 -> %25E3%2580%2580
> This enlarges the ID length unnecessarily.
> Then I add the option to decode URI as the ID before encoding.
>
> But the length may still longer than 512 bytes.
> The other solution is hashing.
> The new added options are the following.
> Raw) uses the URI string as is.
> Hash) hashes (SHA1) the URI string always.
> Hash if long) hashes the URI only if its length exceeds 512 bytes.
> The last one is prepared for the compatibility.
>
> Both of solutions cause a new problem.
> If the URI is decoded or hashed,
> the original URI cannot be keeped in each document.
> Then I add the new fields.
> URI field name) keeps the original URI string as is.
> Decoded URI field name) keeps the decoded URI string.
> The default settings provides these fields as empty.
>
>
> I sended the patch for Ingest-Attachment the other day.
> Then this mail attaches the two patches.
> apache-manifoldcf-2.18-elastic-id.patch.gz:
>  The patch for 2.18 including the patch of the other day.
> apache-manifoldcf-elastic-id.patch.gz:
>  The patch for the source patched the other day.
>
> By the way, I tryed to describe the above to some documents.
> But no suitable document is found in the ManifoldCF package.
> The Elasticsearch document may be wrote for the ancient spacifications.
> Where can I describe this new specifications?
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp

Re: Multiprocess file installation of manifold

2021-02-17 Thread Karl Wright

File synchronization is still supported but is deprecated.  We recommend
zookeeper synchronization unless you have a very good reason not to.

Karl


On Wed, Feb 17, 2021 at 12:26 PM Ananth Peddinti  wrote:

> Hello Team ,
>
>
> I would like to know if someone has already done multi-process model
>   installation of manifold on linux machine .I would like to know the
> process in detail.We are running into issues with the quick start model.
>
>
>
> Regards
>
> Ananth
> --
> 
> -SECURITY/CONFIDENTIALITY WARNING-
>
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of
> the communication is strictly prohibited. If you received the communication
> in error, please notify the sender immediately by replying to this message
> and deleting the message and any accompanying files from your system. If,
> due to the security risks, you do not wish to receive further
> communications via e-mail, please reply to this message and inform the
> sender that you do not wish to receive further e-mail from the sender.
> (LCP301)
> 
>

Re: Job Content Length issue

2021-02-17 Thread Karl Wright

The internal Tika is not memory bounded; some transformations stream, but
others put everything into memory.

You can try using the external tika, with a tika instance you run
separately, and that would likely help.  But you may need to give it lots
of memory too.

Karl


On Wed, Feb 17, 2021 at 3:50 AM ritika jain 
wrote:

> Hi Karl,
>
> I am using Elastic search as an output connector and yes using an internal
> Tika extracter, not using solr output connection.
>
> Also Elastic search server is on hosted on different server with huge
> memory allocation.
>
> On Tue, Feb 16, 2021 at 7:29 PM Karl Wright  wrote:
>
>> Hi, do you mean content limiter length of 100?
>>
>> I assume you are using the internal Tika transformer?  Are you combining
>> this with a Solr output connection that is not using the extract handler?
>>
>> By "manifold crashes" I assume you actually mean it runs out of memory.
>> The "long running query" concern is a red herring because that does not
>> cause a crash under any circumstances.
>>
>> This is quite likely if I described your setup above, because if you do
>> not use the Solr extract handler, the entire content of every document must
>> be loaded into memory.  That is why we require you to fill in a Solr field
>> on those kind of output connections that limits the number of bytes.
>>
>> Karl
>>
>>
>> On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
>> wrote:
>>
>>>
>>>
>>> Hi users
>>>
>>>
>>> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
>>> server which is having some millions billions of records to process and
>>> crawl.
>>>
>>> Total system memory is 64Gb out of which start options file of manifold
>>> is defined as 32GB.
>>>
>>> We have some larger files to crawl around 30 MB of file or more that
>>> than .
>>>
>>> When mentioned size in the content limiter tab is 10 that is 1 MB
>>> job works fine but when changed to 1000 that is 10 MB .. manifold
>>> crashes with some logs with long running query .
>>>
>>> How we can achieve or optimise job specifications to process large
>>> documents also.
>>>
>>> Do I need to increase or decrease the number of connections or number of
>>> worker thread count or something.
>>>
>>> Can anybody help me on this to crawl larger files too at least till 10 MB
>>>
>>> Thanks
>>>
>>> Ritika
>>>
>>

Re: Job Content Length issue

2021-02-16 Thread Karl Wright

Hi, do you mean content limiter length of 100?

I assume you are using the internal Tika transformer?  Are you combining
this with a Solr output connection that is not using the extract handler?

By "manifold crashes" I assume you actually mean it runs out of memory.
The "long running query" concern is a red herring because that does not
cause a crash under any circumstances.

This is quite likely if I described your setup above, because if you do not
use the Solr extract handler, the entire content of every document must be
loaded into memory.  That is why we require you to fill in a Solr field on
those kind of output connections that limits the number of bytes.

Karl

On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
wrote:

>
>
> Hi users
>
>
> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
> server which is having some millions billions of records to process and
> crawl.
>
> Total system memory is 64Gb out of which start options file of manifold is
> defined as 32GB.
>
> We have some larger files to crawl around 30 MB of file or more that than .
>
> When mentioned size in the content limiter tab is 10 that is 1 MB job
> works fine but when changed to 1000 that is 10 MB .. manifold crashes
> with some logs with long running query .
>
> How we can achieve or optimise job specifications to process large
> documents also.
>
> Do I need to increase or decrease the number of connections or number of
> worker thread count or something.
>
> Can anybody help me on this to crawl larger files too at least till 10 MB
>
> Thanks
>
> Ritika
>

Re: content length tab

2021-02-15 Thread Karl Wright

This parameter is in bytes.

Karl


On Mon, Feb 15, 2021 at 9:03 AM ritika jain 
wrote:

> Hi Users,
>
> Can anybody tell me if this can be filled as bytes or kilobytes here.
>
> The "Content Length tab looks like this:
>
>
> [image: Windows Share Job, Content Length tab]
>
> Values are to be filled as 100 , will this be 100 bytes or 100 kilobytes
> or in MB.
>
> Thanks
> Ritika
>

Re: Job status stuck in terminating

2021-01-07 Thread Karl Wright

Hi,

Usually the reason a job doesn't complete is because a document is retrying
indefinitely.  You can see what's going on by looking at the Simple History
job report, or, if you prefer, tailing the manifoldcf log.

Other times a job won't complete because somebody shut down the agents
process.  But that is not the answer for the simple example single-process
deployment.

Karl


On Thu, Jan 7, 2021 at 5:52 PM Isaac Kunz  wrote:

> I have a job that is stuck in terminating for 12 hrs. it is a small test
> job and I am wondering if there is a way to fix this? The job ran once and
> completed 175k documents. I modified the query to the job and reseeded. The
> job was modified to process a smaller document set. I assume reseeding will
> allow the same documents to be indexed. I do not need the metadata for this
> job so if needed I could clear it via db if I knew how. I am a new user.
>
> Thanks,
>
> Isaac
> --
> 
> -SECURITY/CONFIDENTIALITY WARNING-
>
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of
> the communication is strictly prohibited. If you received the communication
> in error, please notify the sender immediately by replying to this message
> and deleting the message and any accompanying files from your system. If,
> due to the security risks, you do not wish to receive further
> communications via e-mail, please reply to this message and inform the
> sender that you do not wish to receive further e-mail from the sender.
> (LCP301)
> 
>

Re: Indexation Not OK

2021-01-01 Thread Karl Wright

Hi,
I don't have the ability to delete mail from mailing lists.  You have to
request Apache Infra do that.

Karl


On Thu, Dec 31, 2020 at 11:38 AM Michael Cizmar 
wrote:

> Ritika – We have had some discussions regarding docker and etc.  The
> public one that is out there builds a single node and does not use an
> RDBM.  I would not recommend using that to index billions of documents.
> You can turn on debugging in the connector and look at the logs to see if
> that traffic is actually going to Elastic search.
>
>
>
> Karl – I believe Ritika said Elastic.
>
>
>
>
>
> --
>
> Michael Cizmar
>
>
>
> *From:* ritika jain 
> *Sent:* Thursday, December 31, 2020 7:33 AM
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Indexation Not OK
>
>
>
> Elastic search output connector with some custom changes for some fields
>
> On Thursday, December 31, 2020, Karl Wright  wrote:
>
> Hi,
> Can you let us know what you are using for the output connector?
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Dec 31, 2020 at 8:24 AM ritika jain 
> wrote:
>
> Hi,
>
>
>
> I am using Manifoldcf 2.14 and JCIFS connector, to ingest some billions of
> records into elastic search
>
> I am facing an issue in which when Job is run some time, successful
> indexation happens but after sometime , manifoldcf loops the records and
> Indexation is not getting OK.
>
>
>
>
>
> and it keeps on retrying for those specific records, then to again start
> up, I need to restart the docker container everytime and after restart
> Indexation works fine for those records too.
>
> And also checked JSON formation of elastic search connector is fine, which
> sures that the files are not having any problem.
>
> Can anybody please guide me the reason for this
>
>
>
> Thanks
>
> Ritika
>
>
>
>
>
>

Re: Indexation Not OK

2020-12-31 Thread Karl Wright

Sorry, I couldn't quite understand everything in your email, but it sounds
like the problem is in the ES connection.  It is possible that ES expires
your connection and the indexing fails after that happens.  If that is
happening, however, I would expect to see a much more detailed error
message in both the logs and in the simple history.  Can you provide any
error messages from the log that seem to be coming from the output
connection?

Thanks,
Karl

On Thu, Dec 31, 2020 at 8:30 AM Karl Wright  wrote:

> Hi,
> Can you let us know what you are using for the output connector?
> Thanks,
> Karl
>
>
> On Thu, Dec 31, 2020 at 8:24 AM ritika jain 
> wrote:
>
>> Hi,
>>
>> I am using Manifoldcf 2.14 and JCIFS connector, to ingest some billions
>> of records into elastic search
>> I am facing an issue in which when Job is run some time, successful
>> indexation happens but after sometime , manifoldcf loops the records and
>> Indexation is not getting OK.
>>
>> [image: image.png]
>>
>> and it keeps on retrying for those specific records, then to again start
>> up, I need to restart the docker container everytime and after restart
>> Indexation works fine for those records too.
>> And also checked JSON formation of elastic search connector is fine,
>> which sures that the files are not having any problem.
>> Can anybody please guide me the reason for this
>>
>> Thanks
>> Ritika
>>
>>
>>

Re: Indexation Not OK

2020-12-31 Thread Karl Wright

Hi,
Can you let us know what you are using for the output connector?
Thanks,
Karl


On Thu, Dec 31, 2020 at 8:24 AM ritika jain 
wrote:

> Hi,
>
> I am using Manifoldcf 2.14 and JCIFS connector, to ingest some billions of
> records into elastic search
> I am facing an issue in which when Job is run some time, successful
> indexation happens but after sometime , manifoldcf loops the records and
> Indexation is not getting OK.
>
> [image: image.png]
>
> and it keeps on retrying for those specific records, then to again start
> up, I need to restart the docker container everytime and after restart
> Indexation works fine for those records too.
> And also checked JSON formation of elastic search connector is fine, which
> sures that the files are not having any problem.
> Can anybody please guide me the reason for this
>
> Thanks
> Ritika
>
>
>

Re: Password admin UI

2020-12-17 Thread Karl Wright

There's no issue I know of, provided that the API rework a couple of years
ago didn't break something.

So I would create a ticket and we'll see if someone can look at it.

Karl


On Thu, Dec 17, 2020 at 8:47 AM  wrote:

> I should mention that I used the obfuscation method provided by
> org.apache.manifoldcf.core.system.ManifoldCF.obfuscate(String) and set the
> obfuscated password in the org.apache.manifoldcf.login.password.obfuscated
> and org.apache.manifoldcf.apilogin.password.obfuscated properties of the
> properties.xml file
>
>
>
> I can also guarantee you that I used UTF-8 encoding to provide the
> password to the obfuscate method and that testing the deobfuscate method
> provides the right password with UTF-8 chars
>
>
>
> Julien
>
>
>
> *De :* Karl Wright 
> *Envoyé :* mercredi 16 décembre 2020 19:40
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Password admin UI
>
>
>
> Hi Julien,
> The properties file is read as utf-8, so as long as you make sure that the
> encoding in your editor is utf-8, it should work.
>
> Many editors default to the Microsoft code page so use something like
> scite or emacs.
>
>
> Karl
>
>
>
> On Wed, Dec 16, 2020 at 12:31 PM  wrote:
>
> Hi,
>
>
>
> I tried different type of password for the admin UI and it appears that
> passwords containing accentuated characters or special characters do not
> work. Is it “normal” or not ?
>
>
>
> Regards,
>
> Julien
>
>
>
>

Re: Password admin UI

2020-12-16 Thread Karl Wright

Hi Julien,
The properties file is read as utf-8, so as long as you make sure that the
encoding in your editor is utf-8, it should work.

Many editors default to the Microsoft code page so use something like scite
or emacs.

Karl

On Wed, Dec 16, 2020 at 12:31 PM  wrote:

> Hi,
>
>
>
> I tried different type of password for the admin UI and it appears that
> passwords containing accentuated characters or special characters do not
> work. Is it “normal” or not ?
>
>
>
> Regards,
>
> Julien
>
>
>

Re: Memory problem on Agent ?

2020-10-02 Thread Karl Wright

Please check your -Xmx switch.

Memory will not be released because that is not how Java works.  It
allocates the memory it needs and periodically garbage collects within
that.  You have given it too much memory and you should not expect Java to
release it ever.  The solution is to give it less.  A rule of thumb is to
leave 10gb free for system usage and divide the remainder among your Java
processes.

Thanks,
Karl


On Fri, Oct 2, 2020 at 11:21 AM Bisonti Mario 
wrote:

> Yes, buti t seems that, when the indexing finished, the memory is not
> released
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* venerdì 2 ottobre 2020 17:14
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Memory problem on Agent ?
>
>
>
> Hi Mario,
>
> Java processes only use the memory you hand them.
>
>
> It looks like you are handing Java more memory than your machine has.
>
> This will not work.
>
>
> Karl
>
>
>
>
>
> On Fri, Oct 2, 2020 at 10:45 AM Bisonti Mario 
> wrote:
>
>
>
> Hallo.
>
>
>
> When I scan the content of Repository , I note that memory used is very
> high and it isn’t released
>
>
>
> i.e. 60GB on 70GB available
>
>
>
> I tried to free shutting down tjhe agent but I am not able:
>
>
>
> /opt/manifoldcf/multiprocess-zk-example-proprietary/stop-agents.sh
>
> OpenJDK 64-Bit Server VM warning: INFO:
> os::commit_memory(0x7f4d5800, 34359738368, 0) failed; error='Not
> enough space' (errno=12)
>
> #
>
> # There is insufficient memory for the Java Runtime Environment to
> continue.
>
> # Native memory allocation (mmap) failed to map 34359738368 bytes for
> committing reserved memory.
>
> # An error report file with more information is saved as:
>
> # /opt/manifoldcf/multiprocess-zk-example-proprietary/hs_err_pid2796.log
>
>
>
> So, to free memory, I have to restart the server
>
> How could I solve this?
>
>
>
> Thanks a lot
>
> Mario
>
>
>
>

Re: Memory problem on Agent ?

2020-10-02 Thread Karl Wright

Hi Mario,

Java processes only use the memory you hand them.

It looks like you are handing Java more memory than your machine has.

This will not work.

Karl


On Fri, Oct 2, 2020 at 10:45 AM Bisonti Mario 
wrote:

>
>
> Hallo.
>
>
>
> When I scan the content of Repository , I note that memory used is very
> high and it isn’t released
>
>
>
> i.e. 60GB on 70GB available
>
>
>
> I tried to free shutting down tjhe agent but I am not able:
>
>
>
> /opt/manifoldcf/multiprocess-zk-example-proprietary/stop-agents.sh
>
> OpenJDK 64-Bit Server VM warning: INFO:
> os::commit_memory(0x7f4d5800, 34359738368, 0) failed; error='Not
> enough space' (errno=12)
>
> #
>
> # There is insufficient memory for the Java Runtime Environment to
> continue.
>
> # Native memory allocation (mmap) failed to map 34359738368 bytes for
> committing reserved memory.
>
> # An error report file with more information is saved as:
>
> # /opt/manifoldcf/multiprocess-zk-example-proprietary/hs_err_pid2796.log
>
>
>
> So, to free memory, I have to restart the server
>
> How could I solve this?
>
>
>
> Thanks a lot
>
> Mario
>
>
>

I updated the site with the new release yesterday, but hasn't gone live

2020-09-18 Thread Karl Wright

Svnpubsub seems to be broken.
I sent email to infrastruct...@apache.org but apparently nobody reads that
anymore.  Stay tuned.

Karl

Re: Job interrupted

2020-08-24 Thread Karl Wright

Ok, I found the 'hard fail' situation.  Here is a patch to fix it:

Index:
connectors/jcifs/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/sharedrive/SharedDriveConnector.java
===
---
connectors/jcifs/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/sharedrive/SharedDriveConnector.java
 (revision 1881006)
+++
connectors/jcifs/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/sharedrive/SharedDriveConnector.java
 (working copy)
@@ -1349,7 +1349,7 @@
   Logging.connectors.warn("JCIFS: 'File in Use' response when
"+activity+" for "+documentIdentifier+": retrying...",se);
   // 'File in Use' skip the document and keep going
   throw new ServiceInterruption("Timeout or other service
interruption: "+se.getMessage(),se,currentTime + 30L,
-currentTime + 3 * 60 * 6L,-1,true);
+currentTime + 3 * 60 * 6L,-1,false);
 }
 else if (se.getMessage().indexOf("cannot find") != -1 ||
se.getMessage().indexOf("cannot be found") != -1)
 {

I'll commit to trunk as well.
Karl

On Mon, Aug 24, 2020 at 9:19 AM Karl Wright  wrote:

> Ok, then let me examine the code and see why it's not catching it.
> Karl
>
>
> On Mon, Aug 24, 2020 at 8:49 AM Bisonti Mario 
> wrote:
>
>> Yes, I see only that exception inside the manifoldcf.log and the job
>> stops with:
>>
>>
>>
>>
>>
>> Error: Repeated service interruptions - failure processing document: The
>> process cannot access the file because it is being used by another process.
>>
>>
>>
>>
>>
>> *Da:* Karl Wright 
>> *Inviato:* lunedì 24 agosto 2020 12:27
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* Re: Job interrupted
>>
>>
>>
>> Well, we look for certain kinds of exceptions from JCIFS and allow the
>> job to continue if we can't succeed.  You have to be sure though that the
>> failure was from *that* exception.  The reason I point that out is because
>> we have already a check for that, I believe.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Mon, Aug 24, 2020 at 5:55 AM Bisonti Mario 
>> wrote:
>>
>> Yes, but after I obtain:
>>
>>
>>
>> Error: Repeated service interruptions - failure processing document: The
>> process cannot access the file because it is being used by another process.
>>
>>
>>
>> And the job stops
>>
>>
>>
>>
>>
>> *Da:* Karl Wright 
>> *Inviato:* lunedì 24 agosto 2020 11:52
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* Re: Job interrupted
>>
>>
>>
>> Hi,
>> That's a warning.  The job will keep running and the document will be
>> retried later.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Mon, Aug 24, 2020 at 5:24 AM Bisonti Mario 
>> wrote:
>>
>> Hallo.
>>
>> I have some problems about job interrupted.
>>
>> The job execute a windows share scan
>>
>>
>>
>> After many errors, sometimes it stops
>>
>>
>>
>> I see in the manifoldcf.log many errors:
>>
>>
>>
>>
>>
>> at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:610)
>> [mcf-jcifs-connector.jar:?]
>>
>> at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> [mcf-pull-agent.jar:?]
>>
>> WARN 2020-08-24T11:17:25,501 (Worker thread '59') - JCIFS: 'File in Use'
>> response when getting document version for smb://
>> fileserver.net/Workgroups/Dir/Dir2/finename.xlsx
>> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Ffileserver.net%2FWorkgroups%2FDir%2FDir2%2Ffinename.xlsx=01%7C01%7CMario.Bisonti%40vimar.com%7Cd726636fb2744bb0882c08d848185962%7Ca1f008bcd59b4c668f8760fd9af15c7f%7C1=lvpKI2hFeY40s4vgbQViO%2BfxXQBivrz4CFD3kHNKy2Q%3D=0>:
>> retrying...
>>
>> jcifs.smb.SmbException: The process cannot access the file because it is
>> being used by another process.
>>
>> at
>> jcifs.smb.SmbTransportImpl.checkStatus2(SmbTransportImpl.java:1441)
>> ~[jcifs-ng-2.1.2.jar:?]
>>
>> at
>> jcifs.smb.SmbTransportImpl.checkStatus(SmbTransportImpl.java:1552)
>> ~[jcifs-ng-2.1.2.jar:?]
>>
>> at
>> jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1007)
>> ~[jcifs-ng-2.1.2.jar:?]
>>
&g

Re: Job interrupted

2020-08-24 Thread Karl Wright

Ok, then let me examine the code and see why it's not catching it.
Karl


On Mon, Aug 24, 2020 at 8:49 AM Bisonti Mario 
wrote:

> Yes, I see only that exception inside the manifoldcf.log and the job stops
> with:
>
>
>
>
>
> Error: Repeated service interruptions - failure processing document: The
> process cannot access the file because it is being used by another process.
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* lunedì 24 agosto 2020 12:27
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job interrupted
>
>
>
> Well, we look for certain kinds of exceptions from JCIFS and allow the job
> to continue if we can't succeed.  You have to be sure though that the
> failure was from *that* exception.  The reason I point that out is because
> we have already a check for that, I believe.
>
>
>
> Karl
>
>
>
>
>
> On Mon, Aug 24, 2020 at 5:55 AM Bisonti Mario 
> wrote:
>
> Yes, but after I obtain:
>
>
>
> Error: Repeated service interruptions - failure processing document: The
> process cannot access the file because it is being used by another process.
>
>
>
> And the job stops
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* lunedì 24 agosto 2020 11:52
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job interrupted
>
>
>
> Hi,
> That's a warning.  The job will keep running and the document will be
> retried later.
>
>
>
> Karl
>
>
>
>
>
> On Mon, Aug 24, 2020 at 5:24 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
> I have some problems about job interrupted.
>
> The job execute a windows share scan
>
>
>
> After many errors, sometimes it stops
>
>
>
> I see in the manifoldcf.log many errors:
>
>
>
>
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:610)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2020-08-24T11:17:25,501 (Worker thread '59') - JCIFS: 'File in Use'
> response when getting document version for smb://
> fileserver.net/Workgroups/Dir/Dir2/finename.xlsx
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Ffileserver.net%2FWorkgroups%2FDir%2FDir2%2Ffinename.xlsx=01%7C01%7CMario.Bisonti%40vimar.com%7Cd726636fb2744bb0882c08d848185962%7Ca1f008bcd59b4c668f8760fd9af15c7f%7C1=lvpKI2hFeY40s4vgbQViO%2BfxXQBivrz4CFD3kHNKy2Q%3D=0>:
> retrying...
>
> jcifs.smb.SmbException: The process cannot access the file because it is
> being used by another process.
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus2(SmbTransportImpl.java:1441)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus(SmbTransportImpl.java:1552)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1007)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:409)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeImpl.send(SmbTreeImpl.java:472)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send0(SmbTreeConnection.java:399)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:314)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:294)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:130)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:117)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.withOpen(SmbFile.java:1747)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.withOpen(SmbFile.java:1716)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.withOpen(SmbFile.java:1710)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.queryPath(SmbFile.java:763)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.exists(SmbFile.java:844)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileExists(SharedDriveConnector.java:2188)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:610)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2020-08-24T11:17:25,502 (Worker thread '59') - Service interruption
> reported for job 1533797717712 connection 'WinShare': Timeout or other
> service interruption: The process cannot access the file because it is
> being used by another process.
>
>
>
>
>
> What  could I check?
>
>
>
> Thanks a lot
>
> Mario
>
>

Re: Job interrupted

2020-08-24 Thread Karl Wright

Well, we look for certain kinds of exceptions from JCIFS and allow the job
to continue if we can't succeed.  You have to be sure though that the
failure was from *that* exception.  The reason I point that out is because
we have already a check for that, I believe.

Karl


On Mon, Aug 24, 2020 at 5:55 AM Bisonti Mario 
wrote:

> Yes, but after I obtain:
>
>
>
> Error: Repeated service interruptions - failure processing document: The
> process cannot access the file because it is being used by another process.
>
>
>
> And the job stops
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* lunedì 24 agosto 2020 11:52
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job interrupted
>
>
>
> Hi,
> That's a warning.  The job will keep running and the document will be
> retried later.
>
>
>
> Karl
>
>
>
>
>
> On Mon, Aug 24, 2020 at 5:24 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
> I have some problems about job interrupted.
>
> The job execute a windows share scan
>
>
>
> After many errors, sometimes it stops
>
>
>
> I see in the manifoldcf.log many errors:
>
>
>
>
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:610)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2020-08-24T11:17:25,501 (Worker thread '59') - JCIFS: 'File in Use'
> response when getting document version for smb://
> fileserver.net/Workgroups/Dir/Dir2/finename.xlsx
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Ffileserver.net%2FWorkgroups%2FDir%2FDir2%2Ffinename.xlsx=01%7C01%7CMario.Bisonti%40vimar.com%7Ca26fd37fa4af4fe8b96708d848135dc1%7Ca1f008bcd59b4c668f8760fd9af15c7f%7C1=iMDk%2FqLW6FLe3gPsqwKVba6OFJw7HZd5XoRTUQGH7tg%3D=0>:
> retrying...
>
> jcifs.smb.SmbException: The process cannot access the file because it is
> being used by another process.
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus2(SmbTransportImpl.java:1441)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus(SmbTransportImpl.java:1552)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1007)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:409)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeImpl.send(SmbTreeImpl.java:472)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send0(SmbTreeConnection.java:399)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:314)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:294)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:130)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:117)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.withOpen(SmbFile.java:1747)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.withOpen(SmbFile.java:1716)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.withOpen(SmbFile.java:1710)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.queryPath(SmbFile.java:763)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.exists(SmbFile.java:844)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileExists(SharedDriveConnector.java:2188)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:610)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2020-08-24T11:17:25,502 (Worker thread '59') - Service interruption
> reported for job 1533797717712 connection 'WinShare': Timeout or other
> service interruption: The process cannot access the file because it is
> being used by another process.
>
>
>
>
>
> What  could I check?
>
>
>
> Thanks a lot
>
> Mario
>
>

Re: Job interrupted

2020-08-24 Thread Karl Wright

Hi,
That's a warning.  The job will keep running and the document will be
retried later.

Karl


On Mon, Aug 24, 2020 at 5:24 AM Bisonti Mario 
wrote:

> Hallo.
>
> I have some problems about job interrupted.
>
> The job execute a windows share scan
>
>
>
> After many errors, sometimes it stops
>
>
>
> I see in the manifoldcf.log many errors:
>
>
>
>
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:610)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2020-08-24T11:17:25,501 (Worker thread '59') - JCIFS: 'File in Use'
> response when getting document version for smb://
> fileserver.net/Workgroups/Dir/Dir2/finename.xlsx: retrying...
>
> jcifs.smb.SmbException: The process cannot access the file because it is
> being used by another process.
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus2(SmbTransportImpl.java:1441)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> jcifs.smb.SmbTransportImpl.checkStatus(SmbTransportImpl.java:1552)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1007)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbSessionImpl.send(SmbSessionImpl.java:409)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeImpl.send(SmbTreeImpl.java:472)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send0(SmbTreeConnection.java:399)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:314)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeConnection.send(SmbTreeConnection.java:294)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:130)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbTreeHandleImpl.send(SmbTreeHandleImpl.java:117)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.withOpen(SmbFile.java:1747)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.withOpen(SmbFile.java:1716)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.withOpen(SmbFile.java:1710)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.queryPath(SmbFile.java:763)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at jcifs.smb.SmbFile.exists(SmbFile.java:844)
> ~[jcifs-ng-2.1.2.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileExists(SharedDriveConnector.java:2188)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:610)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2020-08-24T11:17:25,502 (Worker thread '59') - Service interruption
> reported for job 1533797717712 connection 'WinShare': Timeout or other
> service interruption: The process cannot access the file because it is
> being used by another process.
>
>
>
>
>
> What  could I check?
>
>
>
> Thanks a lot
>
> Mario
>
>

Re: How to reset job status

2020-08-19 Thread Karl Wright

So Mario,

First it appears that you mysteriously cannot build where everyone else
can.  Now you are having mysterious problems with ManifoldCF being able to
do basic state transitions.  I'm unable to reproduce any of these things.
More worrisome, you seem to have the opinion that rather than fix
underlying deployment or infrastructure issues, the right solution is just
to hack away at the database or the code.

This doesn't work for me.

I'd like to help you out here but there's a basic level of cooperation
needed for that.  The way you do deployments in ManifoldCF that we know
will be successful is by starting with one of the distribution examples and
(if needed) modifying that to meet your individual needs.  If you are
having bizarre things take place, almost always it's because you didn't
start with one of the examples and therefore you wound up configuring
things in a bizarre way.  So if you cannot get past your current problem, I
STRONGLY recommend you start over:

- Checkout a new copy of trunk and build following the instructions I gave
in the other email thread.  Follow them to the letter please.
- Pick your deployment model.
- Point it at your database instance.
- Start it USING THE SCRIPTS PROVIDED.

Your problems should resolve.  If not, you should have logging in
manifoldcf.log telling you what is going wrong.

Karl

On Wed, Aug 19, 2020 at 6:40 AM Karl Wright  wrote:

> You do not see log output.  Therefore I need to ask you some questions.
>
> What deployment model are you using?  single process or multi-process?
> what is the synchronization method?
>
>
> On Wed, Aug 19, 2020 at 6:38 AM Karl Wright  wrote:
>
>> Usually when you shut down the agents process (or the whole thing) and
>> restart it will fix problems like that UNLESS the problem persists because
>> a step in the state flow is failing.  If it is failing you would see log
>> output.  Do you see log output?
>>
>> Karl
>>
>>
>> On Wed, Aug 19, 2020 at 5:40 AM Bisonti Mario 
>> wrote:
>>
>>> No, I haven’t a notification connector, buti it isn’t the problem.
>>>
>>> Manifoldcf.log is empty
>>>
>>>
>>>
>>> The problemi s that job is on hanging state and I would like to reset
>>> its state
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Da:* Karl Wright 
>>> *Inviato:* mercoledì 19 agosto 2020 11:31
>>> *A:* user@manifoldcf.apache.org
>>> *Oggetto:* Re: How to reset job status
>>>
>>>
>>>
>>> There should be output in your manifoldcf.log file, no?  This may be the
>>> result of you not having a notification connector's code actually
>>> registered so you get no class found errors.  The only solution is to put
>>> the missing jar in place and restart your agents process.  Have a look at
>>> the log to confirm.
>>>
>>>
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 19, 2020 at 4:56 AM Bisonti Mario 
>>> wrote:
>>>
>>> Hallo
>>>
>>> I have a job in a status “End notification” that hangs on this state.
>>>
>>>
>>>
>>> Is there a way to reset it?
>>>
>>>
>>>
>>> I tried the script lock-clean.sh without effect.
>>>
>>>
>>>
>>> In thise state I am not able to manage jobs.
>>>
>>>
>>>
>>>
>>> What could I try, please?
>>>
>>>
>>>
>>>
>>> Thanks a lot
>>>
>>> Mario
>>>
>>>

Re: How to reset job status

2020-08-19 Thread Karl Wright

You do not see log output.  Therefore I need to ask you some questions.

What deployment model are you using?  single process or multi-process?
what is the synchronization method?


On Wed, Aug 19, 2020 at 6:38 AM Karl Wright  wrote:

> Usually when you shut down the agents process (or the whole thing) and
> restart it will fix problems like that UNLESS the problem persists because
> a step in the state flow is failing.  If it is failing you would see log
> output.  Do you see log output?
>
> Karl
>
>
> On Wed, Aug 19, 2020 at 5:40 AM Bisonti Mario 
> wrote:
>
>> No, I haven’t a notification connector, buti it isn’t the problem.
>>
>> Manifoldcf.log is empty
>>
>>
>>
>> The problemi s that job is on hanging state and I would like to reset its
>> state
>>
>>
>>
>>
>>
>>
>>
>> *Da:* Karl Wright 
>> *Inviato:* mercoledì 19 agosto 2020 11:31
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* Re: How to reset job status
>>
>>
>>
>> There should be output in your manifoldcf.log file, no?  This may be the
>> result of you not having a notification connector's code actually
>> registered so you get no class found errors.  The only solution is to put
>> the missing jar in place and restart your agents process.  Have a look at
>> the log to confirm.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Wed, Aug 19, 2020 at 4:56 AM Bisonti Mario 
>> wrote:
>>
>> Hallo
>>
>> I have a job in a status “End notification” that hangs on this state.
>>
>>
>>
>> Is there a way to reset it?
>>
>>
>>
>> I tried the script lock-clean.sh without effect.
>>
>>
>>
>> In thise state I am not able to manage jobs.
>>
>>
>>
>>
>> What could I try, please?
>>
>>
>>
>>
>> Thanks a lot
>>
>> Mario
>>
>>

Re: How to reset job status

2020-08-19 Thread Karl Wright

Usually when you shut down the agents process (or the whole thing) and
restart it will fix problems like that UNLESS the problem persists because
a step in the state flow is failing.  If it is failing you would see log
output.  Do you see log output?

Karl


On Wed, Aug 19, 2020 at 5:40 AM Bisonti Mario 
wrote:

> No, I haven’t a notification connector, buti it isn’t the problem.
>
> Manifoldcf.log is empty
>
>
>
> The problemi s that job is on hanging state and I would like to reset its
> state
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* mercoledì 19 agosto 2020 11:31
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to reset job status
>
>
>
> There should be output in your manifoldcf.log file, no?  This may be the
> result of you not having a notification connector's code actually
> registered so you get no class found errors.  The only solution is to put
> the missing jar in place and restart your agents process.  Have a look at
> the log to confirm.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Aug 19, 2020 at 4:56 AM Bisonti Mario 
> wrote:
>
> Hallo
>
> I have a job in a status “End notification” that hangs on this state.
>
>
>
> Is there a way to reset it?
>
>
>
> I tried the script lock-clean.sh without effect.
>
>
>
> In thise state I am not able to manage jobs.
>
>
>
>
> What could I try, please?
>
>
>
>
> Thanks a lot
>
> Mario
>
>

Re: How to reset job status

2020-08-19 Thread Karl Wright

There should be output in your manifoldcf.log file, no?  This may be the
result of you not having a notification connector's code actually
registered so you get no class found errors.  The only solution is to put
the missing jar in place and restart your agents process.  Have a look at
the log to confirm.

Karl

On Wed, Aug 19, 2020 at 4:56 AM Bisonti Mario 
wrote:

> Hallo
>
> I have a job in a status “End notification” that hangs on this state.
>
>
>
> Is there a way to reset it?
>
>
>
> I tried the script lock-clean.sh without effect.
>
>
>
> In thise state I am not able to manage jobs.
>
>
>
>
> What could I try, please?
>
>
>
>
> Thanks a lot
>
> Mario
>
>

Re: WebCrawler Connector code

2020-07-07 Thread Karl Wright

Hi Ritika,

You do not want to load the list of seeds on every document processing that
is done for performance reasons.  The connector API does not support
accessing arbitrary job data in part for this reason.

You should NEVER be calling JobManager methods from a connector either.
You have *Activity methods that you can call.

Karl


On Tue, Jul 7, 2020 at 4:04 AM ritika jain  wrote:

> Hi  Karl,
>
> Many thanks for your response.!!
>
> The problem I faced is to get Current JobID , so that's why I used the
> JobStatus class. another thing is to get the seeds corresponding to the
> running JOb ID.
>
> activities object is having value of job ID set in its constructor object.
> But no way  to get the value in WebCrawlerConnector.java as no getter is
> defined.
>
> Another thing is JobManager is having a function getAllSeeds which is
> defined in its interface class IJobManager, but not defined in its
> implementation class JobManager, so it is always returning an empty value.
>
> Thanks
>
>
> On Mon, Jul 6, 2020 at 6:44 PM Karl Wright  wrote:
>
>> Hi Ritika,
>>
>> ' My requirement is to abort a job whenever a seed-corresponding site is
>> down or returning some 5xx response codes. '
>>
>> (1) Connector methods, like addSeedDocuments(), are called by the
>> framework.  You do not call them yourself when you write a connector.  So
>> you are looking in the wrong place here.
>> (2) All that addSeedDocuments does in the web connector is add seed URLs
>> to the queue for the job.  You do not want to change this implementation.
>> (3) The only time the web connector fetches anything is when it is
>> processing documents, in the processDocuments() method.
>> (4) You don't get to control the queue.  Documents are processed by the
>> framework in the order *it* determines they should be processed.  You can
>> create an "event" which must be satisfied before processing can occur but
>> that is all the control you get at the connector level.
>> (5) Similarly, you don't get told which document URLs are seeds.  This
>> information is in the job, and it is included in the job queue "isSeed"
>> field for each document, but it is never sent to any connector method.
>>
>> It is therefore possible to add "isSeed" to the IRepositoryConnector
>> processDocuments() method, which will change the contract for all
>> connectors.  You might be able to prevent carnage by creating a
>> BaseRepositoryConnector method implementation and abstract method that
>> would provide a shim for most connectors.
>>
>> Karl
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 6, 2020 at 8:52 AM ritika jain 
>> wrote:
>>
>>> Hi All,
>>>
>>> I have confusion regarding WebCrawler connector code.My requirement is
>>> to abort a job whenever a seed-corresponding site is down or returning some
>>> 5xx response codes.
>>> So I have used the jobManager errorAbort method for this
>>> in addSeedDocuments method of Webcrawlerconnector.java.., JobStatus class
>>> to get a Job ID.
>>>
>>> My confusion here is to get all seeds corresponding to corresponding job
>>> iD. So I used getAllSeeds() method declared in IJobManager Class.
>>>
>>> Query here is getAllSeeds method when used is returning a length zero
>>> array always.As I doubt this method is not having its corresponding
>>> definition in its implementation class.
>>> *Why this method has not been implemented in its Implementation class
>>> JobManager.*
>>>
>>> *Code done is:-*
>>> String[]
>>> array1=jobManager.getAllSeeds(Long.parseLong(jsr[k].getJobID()));
>>> array 1 is always returning empty array.
>>>
>>> *Also another query is *
>>> public String addSeedDocuments(ISeedingActivity activities,
>>> Specification spec,
>>> String lastSeedVersion, long seedTime, int jobMode)
>>> throws ManifoldCFException, ServiceInterruption
>>>
>>> activities object is having jobID of the job which is calling this
>>> addSeeds method, but the interface as well its implementation class is
>>> having no getter(java) method to get JobID in the method.(it is set in
>>> constructor only)
>>>
>>>
>>> Can anybody please guide me on this.
>>>
>>> Thanks
>>> Ritika
>>>
>>>
>>>
>>>

Re: WebCrawler Connector code

2020-07-06 Thread Karl Wright

Hi Ritika,

' My requirement is to abort a job whenever a seed-corresponding site is
down or returning some 5xx response codes. '

(1) Connector methods, like addSeedDocuments(), are called by the
framework.  You do not call them yourself when you write a connector.  So
you are looking in the wrong place here.
(2) All that addSeedDocuments does in the web connector is add seed URLs to
the queue for the job.  You do not want to change this implementation.
(3) The only time the web connector fetches anything is when it is
processing documents, in the processDocuments() method.
(4) You don't get to control the queue.  Documents are processed by the
framework in the order *it* determines they should be processed.  You can
create an "event" which must be satisfied before processing can occur but
that is all the control you get at the connector level.
(5) Similarly, you don't get told which document URLs are seeds.  This
information is in the job, and it is included in the job queue "isSeed"
field for each document, but it is never sent to any connector method.

It is therefore possible to add "isSeed" to the IRepositoryConnector
processDocuments() method, which will change the contract for all
connectors.  You might be able to prevent carnage by creating a
BaseRepositoryConnector method implementation and abstract method that
would provide a shim for most connectors.

Karl

On Mon, Jul 6, 2020 at 8:52 AM ritika jain  wrote:

> Hi All,
>
> I have confusion regarding WebCrawler connector code.My requirement is to
> abort a job whenever a seed-corresponding site is down or returning some
> 5xx response codes.
> So I have used the jobManager errorAbort method for this
> in addSeedDocuments method of Webcrawlerconnector.java.., JobStatus class
> to get a Job ID.
>
> My confusion here is to get all seeds corresponding to corresponding job
> iD. So I used getAllSeeds() method declared in IJobManager Class.
>
> Query here is getAllSeeds method when used is returning a length zero
> array always.As I doubt this method is not having its corresponding
> definition in its implementation class.
> *Why this method has not been implemented in its Implementation class
> JobManager.*
>
> *Code done is:-*
> String[]
> array1=jobManager.getAllSeeds(Long.parseLong(jsr[k].getJobID()));
> array 1 is always returning empty array.
>
> *Also another query is *
> public String addSeedDocuments(ISeedingActivity activities, Specification
> spec,
> String lastSeedVersion, long seedTime, int jobMode)
> throws ManifoldCFException, ServiceInterruption
>
> activities object is having jobID of the job which is calling this
> addSeeds method, but the interface as well its implementation class is
> having no getter(java) method to get JobID in the method.(it is set in
> constructor only)
>
>
> Can anybody please guide me on this.
>
> Thanks
> Ritika
>
>
>
>

Re: Sharepoint 2019

2020-06-10 Thread Karl Wright

The Sharepoint.dll would allow me to do the build, yes.  I'll email you
directly if you want to send it to me via google docs or some such.

Karl


On Wed, Jun 10, 2020 at 10:10 AM Shelly Singh 
wrote:

> Hi,
>
> Thanks for your response.
> It could be a tricky activity for me to build the plugin and change code
> if and when needed. I am only using the ManifoldCF as a blackbox and not
> familiar with code at all. Would really appreciate if you could add that
> support. I will also try finding someone who can do this, but chances of
> success by that route are bleak.
>
> If it would help, I could share the Sharepoint.dll from 2019 sharepoint.
>
> Thanks!
> Shelly
>
>
>
> On 2020/06/10 05:48:01, Karl Wright  wrote:
> > Hi,
> > One is not available yet.  In order to build one I need a copy of the
> > Sharepoint.dll from a Sharepoint 2019 instance and some time.
> >
> > Karl
> >
> >
> > On Wed, Jun 10, 2020 at 1:30 AM Shelly Singh 
> > wrote:
> >
> > > I am looking for Sharepoint 2019 plugin. Is one available?
> > >
> >
>

Re: Sharepoint 2019

2020-06-10 Thread Karl Wright

Forgot the svn path:

https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2019/trunk

Karl


On Wed, Jun 10, 2020 at 2:07 AM Karl Wright  wrote:

> I've set up an svn path for this plugin.  If you can "svn co" this path it
> should give you the plugin source (no different from 2016 I hope) plus all
> the necessary instructions for building.  So, theoretically, you could
> build it yourself, BUT:
>
> - you need to find the version number of the sharepoint DLL and put it
> into the appropriate file before you can do that, and
> - if there are any Sharepoint API method signature changes there may need
> to be code changes, and
> - access to SharePoint via web services is deprecated and I have no idea
> if 2019 even contains it.
>
> Microsoft wants people to use REST access now, and that requires a
> complete redevelopment of the connector, something that's too massive to
> undertake myself.  Co-development might be possible; I did begin work two
> years ago and would have to dust that off.  If you are able to build the
> plugin, on the other hand, chances are good it will just work.  If you
> would like me to build a distribution release, I will need the DLL in order
> to be able to do that.
>
> Karl
>
>
> On Wed, Jun 10, 2020 at 1:48 AM Karl Wright  wrote:
>
>> Hi,
>> One is not available yet.  In order to build one I need a copy of the
>> Sharepoint.dll from a Sharepoint 2019 instance and some time.
>>
>> Karl
>>
>>
>> On Wed, Jun 10, 2020 at 1:30 AM Shelly Singh 
>> wrote:
>>
>>> I am looking for Sharepoint 2019 plugin. Is one available?
>>>
>>

Re: Sharepoint 2019

2020-06-10 Thread Karl Wright

I've set up an svn path for this plugin.  If you can "svn co" this path it
should give you the plugin source (no different from 2016 I hope) plus all
the necessary instructions for building.  So, theoretically, you could
build it yourself, BUT:

- you need to find the version number of the sharepoint DLL and put it into
the appropriate file before you can do that, and
- if there are any Sharepoint API method signature changes there may need
to be code changes, and
- access to SharePoint via web services is deprecated and I have no idea if
2019 even contains it.

Microsoft wants people to use REST access now, and that requires a complete
redevelopment of the connector, something that's too massive to undertake
myself.  Co-development might be possible; I did begin work two years ago
and would have to dust that off.  If you are able to build the plugin, on
the other hand, chances are good it will just work.  If you would like me
to build a distribution release, I will need the DLL in order to be able to
do that.

Karl

On Wed, Jun 10, 2020 at 1:48 AM Karl Wright  wrote:

> Hi,
> One is not available yet.  In order to build one I need a copy of the
> Sharepoint.dll from a Sharepoint 2019 instance and some time.
>
> Karl
>
>
> On Wed, Jun 10, 2020 at 1:30 AM Shelly Singh 
> wrote:
>
>> I am looking for Sharepoint 2019 plugin. Is one available?
>>
>

Re: Sharepoint 2019

2020-06-09 Thread Karl Wright

Hi,
One is not available yet.  In order to build one I need a copy of the
Sharepoint.dll from a Sharepoint 2019 instance and some time.

Karl


On Wed, Jun 10, 2020 at 1:30 AM Shelly Singh 
wrote:

> I am looking for Sharepoint 2019 plugin. Is one available?
>

Re: Window shares job-Error ERROR: invalid byte sequence for encoding "UTF8": 0x00

2020-06-03 Thread Karl Wright

This is a Postgresql problem of some kind.  It could be the network
connection between your ManifoldCF process(es) and the Postgresql server.
If it's repeating I'd worry about it, otherwise it will recover.

Karl


On Wed, Jun 3, 2020 at 3:58 AM ritika jain  wrote:

> Hi All,
>
> I am using Window's shares connector and output connector as ES  and
> Postgres as database in Manifoldcf 2.14.
>
> Job is to crawl almost 20lakhs of records.
> When checked logs got this error:-
>
>
> * Worker thread aborting and restarting due to database connection reset:
> Database exception: SQLException doing query (22021): ERROR: invalid byte
> sequence for encoding "UTF8":
> 0x00org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
> exception: SQLException doing query (22021): ERROR: invalid byte sequence
> for encoding "UTF8": 0x00*
>
> I have not customized the code yet. Can anybody please me let know what is
> reason that is causing this error and how can we get rid of this.
>
> Thanks
> Ritika
>
>
>

Re: Web connector login sequence

2020-06-02 Thread Karl Wright

Thanks for the followup.

If you could create a ticket for this, and supply as much information as
possible, I'll try to look at it and understand the implications for the
session login model as it stands today.  I believe I presumed that the same
URL wouldn't be used for wildly different kinds of things; as you say,
turning this into a list rather than a map may make all the difference.

Karl


On Tue, Jun 2, 2020 at 7:34 AM  wrote:

> Hi Karl,
>
>
>
> Thanks for your answer.
>
>
>
> The login sequence I configured was the problem but not because some part
> were missing, the main problem was that I entered the same regular
> expression to address two different login types : a login page and a
> redirect page.
> I did not check the code, but it seems that the connector saves the login
> sequence into an HashMap with the login regex as key. So my redirect rule 
> “other-site\/cas\/login
> = redirect” was overridden by the form rule “other-site\/cas\/login =
> form”. This is why in the debug log, the other-site 302 response was not
> recognized by the login sequence.
>
>
>
> I have modified the two rules so that the regex are different and it works
> !
>
>
>
> I hope my use case will help other people if they encounter the same
> problem.
>
>
>
> Note that the solution I implemented sounds to me more like a workaround
> than a solution. Let me explain: I was able to differentiate the regex
> rules by removing a letter in one of them:
> “other-site\/cas\/logi = redirect” vs “other-site\/cas\/login = form”. But
> this does not feel like a “clean” solution
>
>
>
> Regards,
> Julien
>
>
>
>
>
>
>
> *De :* Karl Wright 
> *Envoyé :* vendredi 29 mai 2020 22:32
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Web connector login sequence
>
>
>
> Hi Julien,
>
> The login sequence must include all parts of the login sequence, from
> initiation (the first 302 that you get when you load /site) all the way
> through to the last action that sets the cookie.  After the login sequence
> is completed, the /site URL will be fetched again.  If you need more than
> one fetch to set more than one cookie, ALL the fetches must match your
> description of the login sequence or it will abort early.  If the cookie
> gets set on a final redirection, be sure to include that redirection too.
>
>
>
> Karl
>
>
>
>
>
> On Fri, May 29, 2020 at 12:01 PM  wrote:
>
> Hi MCF community,
>
>
>
> I need some help with the configuration of a login sequence with the Web
> connector. Here is the login sequence on a web browser :
>
>
>
> GET site/
>
> 302 -> site/login
>
> 302 -> other-site/cas/login
>
> 401 other-site/cas/login
>
> POST other-site/cas/login (set cookie)
>
> 302 -> site/login?param1=value (set cookie)
>
> 302 -> site/login?param1=value (set cookie)
>
> 302 -> site/
>
>
>
> I tested the following conf :
>
>
>
> Session: site
>
>   site\/login = redirect
>
> other-site\/cas\/login = redirect
>
>   other-site\/cas\/login = form
>
>   username=john
>
>password=***
>
>
>
> This configuration works till the form POST, after the form POST, the
> first cookie is correctly retrieved by the job but then it ends up in an
> infinite loop. Here are the debug logs:
>
>
>
> ….
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: For
> https://other-site/cas/login, setting virtual host to other-site
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Got an HttpClient object
> after 1 ms.
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Post method for
> '/cas/login'
>
> …..
>
> DEBUG 2020-05-29T15:07:18,442 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Retrieving cookies...
>
> DEBUG 2020-05-29T15:07:18,442 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB:   Cookie '[version:
> 0]xx
>
> INFO 2020-05-29T15:07:18,448 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: FETCH LOGIN|
> https://other-site/cas/login|1590764838416+31|302|0|
> <https://other-site/cas/login%7C1590764838416+31%7C302%7C0%7C>
>
> DEBUG 2020-05-29T15:07:18,448 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Document '
> https://other-site/cas/login' did not match expected form, link,
> redir

Re: Crawling / Indexation Query

2020-05-30 Thread Karl Wright

We can't.  You need to follow the instructions and send email to the
appropriate address, listed here:

http://manifoldcf.apache.org/en_US/mail.html

Karl


On Sat, May 30, 2020 at 4:40 PM Shashank Saurabh 
wrote:

> Please unsubscribe me from your mailing list.
>
> Thanks,
> Shashank
>
> On Thu, May 7, 2020 at 4:11 PM Karl Wright  wrote:
>
>> Hi,
>>
>> ManifoldCF is not a crawler, it's a synchronizer.  If robots says not to
>> crawl something, then it will not be indexed.  If robots is changed to
>> prohibit crawling of certain documents, then yes, those documents will be
>> removed from the index.
>>
>> But you can override the robots behavior in the document specification or
>> configuration, I believe.
>>
>> Karl
>>
>>
>> On Thu, May 7, 2020 at 6:27 AM ritika jain 
>> wrote:
>>
>>> Hi All,
>>>
>>> Can any body explain
>>> If a URL was indexed, and afterwards a noindex tag was added - will that
>>> URL then be deleted from the index when it is visited again by the crawler?
>>>
>>>
>>> Say a url was previously having indexation required meta tag and was
>>> present in Elastic index, but then afterwards
>>> 
>>> was added to page design afterwards.
>>>
>>> Should it be deleted from Index when the Manifoldcf job crawl that url
>>> again or the URL will still be present in the index.
>>>
>>> Thanks
>>>
>>>
>>>
>>

Re: Web connector login sequence

2020-05-29 Thread Karl Wright

Hi Julien,

The login sequence must include all parts of the login sequence, from
initiation (the first 302 that you get when you load /site) all the way
through to the last action that sets the cookie.  After the login sequence
is completed, the /site URL will be fetched again.  If you need more than
one fetch to set more than one cookie, ALL the fetches must match your
description of the login sequence or it will abort early.  If the cookie
gets set on a final redirection, be sure to include that redirection too.

Karl


On Fri, May 29, 2020 at 12:01 PM  wrote:

> Hi MCF community,
>
>
>
> I need some help with the configuration of a login sequence with the Web
> connector. Here is the login sequence on a web browser :
>
>
>
> GET site/
>
> 302 -> site/login
>
> 302 -> other-site/cas/login
>
> 401 other-site/cas/login
>
> POST other-site/cas/login (set cookie)
>
> 302 -> site/login?param1=value (set cookie)
>
> 302 -> site/login?param1=value (set cookie)
>
> 302 -> site/
>
>
>
> I tested the following conf :
>
>
>
> Session: site
>
>   site\/login = redirect
>
> other-site\/cas\/login = redirect
>
>   other-site\/cas\/login = form
>
>   username=john
>
>password=***
>
>
>
> This configuration works till the form POST, after the form POST, the
> first cookie is correctly retrieved by the job but then it ends up in an
> infinite loop. Here are the debug logs:
>
>
>
> ….
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: For
> https://other-site/cas/login, setting virtual host to other-site
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Got an HttpClient object
> after 1 ms.
>
> DEBUG 2020-05-29T15:07:25,560 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Post method for
> '/cas/login'
>
> …..
>
> DEBUG 2020-05-29T15:07:18,442 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Retrieving cookies...
>
> DEBUG 2020-05-29T15:07:18,442 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB:   Cookie '[version:
> 0]xx
>
> INFO 2020-05-29T15:07:18,448 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: FETCH LOGIN|
> https://other-site/cas/login|1590764838416+31|302|0|
> 
>
> DEBUG 2020-05-29T15:07:18,448 (Worker thread '11') -
> MCF|MCF-agent|apache.manifoldcf.connectors|WEB: Document '
> https://other-site/cas/login' did not match expected form, link,
> redirection, or content for sequence 'site'
>
> ….
>
>
>
> It seems that the redirection after the form POST is not considered by the
> job but I don’t know why. After that, there is an infinite loop where the
> cookie is passed on the GET “site/login” which redirects to
> “other-site/login”, but this time, when “other-site/login” get the cookie
> in the request, it does not send a 302 redirect response code but a 200 OK
>
>
>
> I don’t know why there is such behavior and I would be glad to have your
> advises !
>
>
>
> Thanks for your help
>
>
>
> Julien
>
>
>

Re: URL Mapping

2020-05-28 Thread Karl Wright

That's a much better case for using the url mapper, yes.


On Thu, May 28, 2020 at 1:40 PM Michael Cizmar 
wrote:

> Right.  Another case that I'm exploring...crawling an internal site and
> wanting a load balanced url.  So you would crawl something like this:
>
> http://mystaging-server.myco.com/index.html
>
> and then want to change it to:
>
> https://www.myco.com/index.html
>
> Is that better for the url mapper?
>
>
>
> --
>
> Michael Cizmar
> Managing Director
>
> p: 312.585.6396
>
> d: 312.585.6286
> twitter: @michaelcizmar <http://twitter.com/michaelcizmar>
>
> http://www.mcplusa.com/
>
>
> The information contained in this communication is confidential, private,
> proprietary, or otherwise privileged and is intended only for the use of
> the addressee.  This e-mail is intended only for the person or entity to
> whom it is directed.  Unauthorized use, disclosure, distribution or copying
> is strictly prohibited and may be unlawful.  If you are not the intended
> recipient, please notify us immediately and permanently delete this e-mail
> and any attachments.
>
> --
> *From:* Karl Wright 
> *Sent:* Thursday, May 28, 2020 12:03 PM
> *To:* user@manifoldcf.apache.org 
> *Subject:* Re: URL Mapping
>
> Thanks!  It's far better to implement this than to try and hack it.  A
> general way of removing session information with regular expressions is
> probably not going to cut it either, so for now it's got to be in Java.
>
> Karl
>
>
> On Thu, May 28, 2020 at 12:47 PM Michael Cizmar <
> michael.ciz...@mcplusa.com> wrote:
>
> The "!ut" and then a bunch of session information is from Web Sphere
> Portal.  Some information about it here:
>
> https://books.google.com/books?id=bqAXnpmj5LwC=PA180=PA180=%22!ut%22+session+variables+websphere#v=onepage=%22!ut%22%20session%20variables%20websphere=false
>
> I'll look at making a change to the web crawler to suppor this like the BV
> and ASP.NET
>
> --
> *From:* Karl Wright 
> *Sent:* Thursday, May 28, 2020 11:41 AM
> *To:* user@manifoldcf.apache.org 
> *Subject:* Re: URL Mapping
>
> Hi,
>
> There are provisions in the URL canonicallization part of the world for
> removal of session information from the URL.  It only knows about some
> kinds of widely used sessions; java app server sessions, for example,
> Broadvision sessions, etc.  If you can convince me that your session
> information is (a) uniquely identifiable, and (b) commonly used, the proper
> approach is to incorporate session removal in this framework.  Please let
> me know.
>
> Karl
>
>
> On Thu, May 28, 2020 at 12:11 PM Michael Cizmar <
> michael.ciz...@mcplusa.com> wrote:
>
> I've got a really long url with a bunch of unnecessary session query
> string parameters.  I've been trying unsuccessfully to map it to the same
> url without the session.
>
> an example of the url below.  I thought I could do this:
>
> url map regular expression:
>
> (.*)\/!ut
>
> replacement configuration:
>
>
>
>
> So the go would be that the url be:
>
> http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/
>
> But the url gets rejected.
>
> Sample Crawl Url
>
>
> http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/!ut/p/a1/rZHLTsMwEEV_hS6yjDx5OWZpdRFImzYCAYk3lZM6D5TYSWoqPh8HFu2GQhHejEeae-aOLmIoQ0zyY1tz3SrJu7lneLfdBtTxI1iRhzsMFEfrpZ_6AFFoBnIzAN88Cj_pXxBDrJR60A3KeS2kvimV1KZaMKhJ886C8U1pIeSkOtNM3Pz5QewO3IJG9WIGDGW7RzkB7hZFIWxyyx3bL8LAJo6L7QoELitMPAH7r4WXLefmpvBkOoqfiTHth6vYTRxIAT1eufMy8D74Z2DqXg2Mf5Fz-zqOjJq05nzeNcr-FpchuVOyTGpjkOvGbmWlUHYmQtmZCGWfoqF_6omHq83G5gUBL-iOa0oXiw9FOxLu/dl5/d5/L0lJS2FZcHBpbW1LYVlwcGltbVlwcGchIS9vSHd3QUFBSXdpRUFJSkRBQ1VZaUVJVTVCZ09DbFFBQUlBQVNvU0FyUnFBQURBQWF0QXdMTzlRQUFFQUJ3WWVBR0tTQUFDa0k1Z21HU3dTaXJTQUFDZ0s5ZzBIUS80SmlHcGhxRWFoR29ScUVhbEdwaC9aNl9PTzVBMTRHMEs4Ukg2MEE2R0xDNFA0MDBHNy9hZ2VudCBjb250ZW50JTBwb3J0YWwlMHF1b3RlZW5yb2xsJTBkaWdzIC0gcXVvdGluZyAgZW5yb2xsbWVudCAoaW5kaXZpZHVhbCkvZjQ0YmEyOWUtODQwOC00YjFlLTg4MzktMTFlMjI4NDgxYTVhL2RpZ3MgLSBxdW90aW5nICBlbnJvbGxtZW50IChpbmRpdmlkdWFsKQ
>
>

Re: Error: Repeated service interruptions - failure processing document: Failed to acquire credits in time

2020-05-21 Thread Karl Wright

So the folder is accessible, but can you open the specific document
itself?  There may be an issue there unrelated to the folder.

If it does open OK, then I'm afraid you're beyond my knowledge of what the
problem might be.  The current JCIFS library comes from a Github project
and perhaps you can contact the maintainers to get them to interpret what
it means.  Sometimes just googling the precise error message (not
ManifoldCF's, but the underlying JCIFS error) can help clarify the issue.

Karl


On Thu, May 21, 2020 at 4:00 AM ritika jain 
wrote:

> Reply:-
> The smb exception means that it is coming from the JCIFS library, which is
> trying to find documents and their metadata from your windows shares, and
> is apparently not getting something it needs back promptly. Perhaps the
> user you are using to do the crawl has insufficient privileges? Also, the
> error you are seeing is a new one; I've never seen that before, so the
> connector hasn't either, and it basically doesn't know whether to skip the
> document or hard fail. But what I'd do is try to open the document yourself
> in Windows and find out whether it seems to work or not, for a start.
>
> Many Thanks for you reply,
> Surely will now follow mail chain only.
> I have checked the user privileges. User is having  all access rights.
> Also the manual access to folders is working fine and folder is accessible.
> Can it be possible in any case, the window shares connector faces some
> problem while connecting? (a network issue)
>
> Thanks
> Ritika
>
> On Tue, May 19, 2020 at 2:39 PM Karl Wright  wrote:
>
>> I commented in the ticket you created.
>> Thanks,
>> Karl
>>
>> On Tue, May 19, 2020 at 3:07 AM ritika jain 
>> wrote:
>>
>>> Hi All,
>>>
>>> I am configured Units job (Manifoldcf 2.14 and ES 7.6.2 and postgres
>>> 9.6.10) on server to access files from samba SMBv3 server and used
>>> jcifs-ng-2.1.2.jar to be loaded in lib of manifoldcf.
>>>
>>> After ingesting some records into the index , the got this error in logs
>>> :-
>>>  :-Unrecognized SmbException thrown getting document version for
>>> smb://store1.directory.intra/folders/UnitsTag1/Hydraulic Engineering/13 HYE
>>> Data/morelis/VSS/MatlabTools/data/s/srca.a
>>> jcifs.smb.SmbException: Failed to acquire credits in time.
>>>
>>> Can anybody please help me understand what can be the possible cause of
>>> this error. Can it be a network connection issue or something else.
>>>
>>> For info:- no authority connection/ Active Directory is being used till
>>> now. Also the Use SID for security (checkbox on manifoldcf UI):- is
>>> UNCHECKED.
>>>
>>> Any help will be appreciated greatly.
>>>
>>> Thanks
>>> RItika
>>>
>>>
>>>
>>>
>>>
>>>

Re: Error: Repeated service interruptions - failure processing document: Failed to acquire credits in time

2020-05-19 Thread Karl Wright

I commented in the ticket you created.
Thanks,
Karl

On Tue, May 19, 2020 at 3:07 AM ritika jain 
wrote:

> Hi All,
>
> I am configured Units job (Manifoldcf 2.14 and ES 7.6.2 and postgres
> 9.6.10) on server to access files from samba SMBv3 server and used
> jcifs-ng-2.1.2.jar to be loaded in lib of manifoldcf.
>
> After ingesting some records into the index , the got this error in logs :-
>  :-Unrecognized SmbException thrown getting document version for
> smb://store1.directory.intra/folders/UnitsTag1/Hydraulic Engineering/13 HYE
> Data/morelis/VSS/MatlabTools/data/s/srca.a
> jcifs.smb.SmbException: Failed to acquire credits in time.
>
> Can anybody please help me understand what can be the possible cause of
> this error. Can it be a network connection issue or something else.
>
> For info:- no authority connection/ Active Directory is being used till
> now. Also the Use SID for security (checkbox on manifoldcf UI):- is
> UNCHECKED.
>
> Any help will be appreciated greatly.
>
> Thanks
> RItika
>
>
>
>
>
>

Re: Crawling / Indexation Query

2020-05-07 Thread Karl Wright

Hi,

ManifoldCF is not a crawler, it's a synchronizer.  If robots says not to
crawl something, then it will not be indexed.  If robots is changed to
prohibit crawling of certain documents, then yes, those documents will be
removed from the index.

But you can override the robots behavior in the document specification or
configuration, I believe.

Karl

On Thu, May 7, 2020 at 6:27 AM ritika jain  wrote:

> Hi All,
>
> Can any body explain
> If a URL was indexed, and afterwards a noindex tag was added - will that
> URL then be deleted from the index when it is visited again by the crawler?
>
>
> Say a url was previously having indexation required meta tag and was
> present in Elastic index, but then afterwards
> 
> was added to page design afterwards.
>
> Should it be deleted from Index when the Manifoldcf job crawl that url
> again or the URL will still be present in the index.
>
> Thanks
>
>
>

Re: ES 7.6.2

2020-05-07 Thread Karl Wright

Hi Ritika,

ManifoldCF's ElasticSearch connector does not include any code that
requires Java 11, so you are all set.

Because JDK 11 removes many packages, however, you should expect to run
ManifoldCF 2.14 with Java 8.  ManifoldCF 2.16, just released, supports Java
11.

Karl

On Thu, May 7, 2020 at 5:14 AM ritika jain  wrote:

> Hi,
>
> Can any body tell me please whether Manifoldcf 2.14 version is compatible
> with Elastic Search Version 7.6.2 as it requires Java 11.
>
> Thanks
> Ritika
>

Re: Extraction of related links

2020-02-12 Thread Karl Wright

This is not functionality that ManifoldCF supports out of the box.  The
extracted links are used for crawling, not as metadata.

I don't see a general use-case for this either, so I think you're on your
own modifying the web connector code to do what you want.  The
RepositoryDocument structure has arbitrary multi-valued fields; just put
what you want into one such field and you should see it in Elastic Search.

Karl

On Thu, Feb 13, 2020 at 1:57 AM ritika jain 
wrote:

> Hi All,
>
> I am using Manifoldcf 2.12, Repository as Web connector and Output as ES.
> As per requirement now, I want to save all related sub-links of a
> particular document Identifier(at a time). For example :-DocumentId::-
> www.xyz.com, so I would like to extract all related sublinks say:-
> www.xyz.com/abc, www.xyz.com/pqr etc.and save it in variable and then
> pass it to Elastic search
>
> I had gone the Web Repo code and thought of the function extractLinks
> ( protected boolean extractLinks(String documentIdentifier,
> IProcessActivity activities, DocumentURLFilter filter)) can do so.
> Is the existing functionality of MF is able for this extraction or we have
> to customize it? Any help would be appreciated.
>
>
> Thanks
> Ritika
>

Re: CSWS Connector : ServiceConstructionException: Failed to create service

2020-01-30 Thread Karl Wright

Hi,

I've been waiting for a ticket to appear that summarizes what's been
happening for this issue but I haven't seen one.  Can you bring us up to
date?  Thanks in advance,
Karl

On Wed, Jan 22, 2020 at 12:06 PM Jörn Franke  wrote:

> i try to help. I will create a Jira if you do not mind where I also
> explain how I make the WSDL thing working for https, which could not have
> worked before.
> Reason is that for fetching the WSDL it uses a completely different
> approach (WSDL4Java), where in the connector code no truststore was defined
>  (only for doing actually SOAP requests). This was fixed and it can create
> the service. But then after service creation in the following lines of
> codes it fetches the xsd which does not work (it says it cannot find them),
> which is strange as the Url is correct and reachable. Hence, I suspect it
> does not understand it is a http URL but instead it tries to open it on the
> file system.
> I am pretty busy at the moment, but I try to support and give feedback as
> soon as possible.
>
> I don’t know what is different to your 2 installations - maybe they are
> http or partly http or there is some patch to the code that did not make it
> into the git.
> I can tell that I can test with Content Server 10 as well as 16.
> SOAP UI has no problem and in the end it does exactly the same (starting
> from the https WSDL etc.)
>
> Am 22.01.2020 um 13:17 schrieb Karl Wright :
>
> 
> The whole web services java + cxf architecture is pretty mysterious.  The
> only way I've made progress is by finding code snippets in stackoverflow;
> the documentation is not adequate.  BUT there are configuration files that
> determine how the WSDL parser resolves references.  I don't know how we
> would force that configuration to be in effect but something like that
> would need to be done.  I'm just surprised that you're having this problem
> when two other installations didn't.  There must be a difference somewhere.
>
> Karl
>
>
> On Wed, Jan 22, 2020 at 5:11 AM Jörn Franke  wrote:
>
>> Sorry I did not have much time, my next action plan is to try to modify
>> the catalogue xml to fetch it directly from the https. For some reasons it
>> can fetch the WSDL (after my fix), but not the included xsds despite that
>> in the error message it has the correct url of them.
>> Are you aware of any configuration that tries to force file based access
>> of those? In the Code i did not find anything suspicious.
>>
>> Am 22.01.2020 um 10:28 schrieb Karl Wright :
>>
>> 
>> Has there been any news?
>> I'd love to get this tied up so that you're able to proceed.
>> Karl
>>
>> On Thu, Jan 16, 2020 at 12:08 PM Jörn Franke 
>> wrote:
>>
>>> Ok I understand. I will try and let you know. Thanks again very much for
>>> your fast and detailed answer. Really appreciated. I hope I can give back
>>> with the solution to fetch WSDLs from https and maybe a solution to this
>>> problem (maybe other will face this as well).
>>>
>>> About the connector: the WSDL is successfully fetched via https (not
>>> file - no clue why) - after the modification I made. The only problem I see
>>> now is that the xsd to which the WSDL is referring are not fetched. The
>>> bizarre thing is that the https url that it mention for the xsd is
>>> absolutely correct. So I assume it does not understand an http url, maybe
>>> that is related to configuration.
>>>
>>> Am 16.01.2020 um 14:53 schrieb Karl Wright :
>>>
>>> 
>>> The WSDLS are bundled with the jar.  We intended this to be the ONLY way
>>> the wsdls were accessed, and made lots of changes to the wsdls accordingly,
>>> so that they referenced other wsdls via the "file system".  The wsdls are
>>> the fixed up ones that are used to build the java stubs locally, plus a
>>> config file that's supposed to tell CXF how to resolve referenced wsdls.
>>> That config file may or may not be correct, because we never were able to
>>> get CXF to use the local resource wsdls during actual connection.
>>>
>>> Except now they seem to be both fetched via https AND locally sourced.
>>> I have no idea how that can be.  I had assumed it was done one way or the
>>> other but not both.
>>>
>>> Perhaps the problem is that the configuration file is being read but the
>>> resource wsdls are not being found?  Removing the meta-inf from the jar
>>> would then force everything to go through https.  Ideally I'd love it if
>>> that wasn't needed and we could get the resource fetch working everywhere.
>>>
>

Re: sharepoint crawler documents limit

2020-01-27 Thread Karl Wright

I'm glad you got by this.  Thanks for letting us know what the issue was.
Karl

On Mon, Jan 27, 2020 at 4:05 AM Jorge Alonso Garcia 
wrote:

> Hi,
> We had change timeout on sharepoint IIS and now the process is able to
> crall all documents.
> Thanks for your help
>
>
>
> El lun., 30 dic. 2019 a las 12:18, Gaurav G ()
> escribió:
>
>> We had faced a similar issue, wherein our repo had 100,000 documents but
>> our crawler stopped after 5 documents. The issue turned out to be that
>> the Sharepoint query that was fired by the Sharepoint web service gets
>> progressively slower and eventually the connection starts timing out before
>> the next 1 records get returned. We increased a timeout parameter on
>> Sharepoint to 10 minutes and then after that we were able to crawl all
>> documents successfully.  I believe we had increased the parameter indicated
>> in the link below
>>
>>
>> https://weblogs.asp.net/jeffwids/how-to-increase-the-timeout-for-a-sharepoint-2010-website
>>
>>
>>
>> On Fri, Dec 20, 2019 at 6:27 PM Karl Wright  wrote:
>>
>>> Hi Priya,
>>>
>>> This has nothing to do with anything in ManifoldCF.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Dec 20, 2019 at 7:56 AM Priya Arora  wrote:
>>>
>>>> Hi All,
>>>>
>>>> Is this issue something to have with below value/parameters set in
>>>> properties.xml.
>>>> [image: image.png]
>>>>
>>>>
>>>> On Fri, Dec 20, 2019 at 5:21 PM Jorge Alonso Garcia 
>>>> wrote:
>>>>
>>>>> And what other sharepoint parameter I could check?
>>>>>
>>>>> Jorge Alonso Garcia
>>>>>
>>>>>
>>>>>
>>>>> El vie., 20 dic. 2019 a las 12:47, Karl Wright ()
>>>>> escribió:
>>>>>
>>>>>> The code seems correct and many people are using it without
>>>>>> encountering this problem.  There may be another SharePoint configuration
>>>>>> parameter you also need to look at somewhere.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia <
>>>>>> jalon...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Karl,
>>>>>>> On sharepoint the list view threshold is 150,000 but we only receipt
>>>>>>> 20,000 from mcf
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>>
>>>>>>> Jorge Alonso Garcia
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> El jue., 19 dic. 2019 a las 19:19, Karl Wright ()
>>>>>>> escribió:
>>>>>>>
>>>>>>>> If the job finished without error it implies that the number of
>>>>>>>> documents returned from this one library was 1 when the service is
>>>>>>>> called the first time (starting at doc 0), 1 when it's called the
>>>>>>>> second time (starting at doc 1), and zero when it is called the 
>>>>>>>> third
>>>>>>>> time (starting at doc 2).
>>>>>>>>
>>>>>>>> The plugin code is unremarkable and actually gets results in chunks
>>>>>>>> of 1000 under the covers:
>>>>>>>>
>>>>>>>> >>>>>>
>>>>>>>> SPQuery listQuery = new SPQuery();
>>>>>>>> listQuery.Query = ">>>>>>> Override=\"TRUE\">";
>>>>>>>> listQuery.QueryThrottleMode =
>>>>>>>> SPQueryThrottleOption.Override;
>>>>>>>> listQuery.ViewAttributes =
>>>>>>>> "Scope=\"Recursive\"";
>>>>>>>> listQuery.ViewFields = ">>>>>>> Name='FileRef' />";
>>>>>>>> listQuery.RowLimit = 1000;
>>>>>>>>
>>>>>>>> XmlDocument doc = new XmlDocument();
>>>>>>>>

Re: CSWS Connector : ServiceConstructionException: Failed to create service

2020-01-22 Thread Karl Wright

The whole web services java + cxf architecture is pretty mysterious.  The
only way I've made progress is by finding code snippets in stackoverflow;
the documentation is not adequate.  BUT there are configuration files that
determine how the WSDL parser resolves references.  I don't know how we
would force that configuration to be in effect but something like that
would need to be done.  I'm just surprised that you're having this problem
when two other installations didn't.  There must be a difference somewhere.

Karl


On Wed, Jan 22, 2020 at 5:11 AM Jörn Franke  wrote:

> Sorry I did not have much time, my next action plan is to try to modify
> the catalogue xml to fetch it directly from the https. For some reasons it
> can fetch the WSDL (after my fix), but not the included xsds despite that
> in the error message it has the correct url of them.
> Are you aware of any configuration that tries to force file based access
> of those? In the Code i did not find anything suspicious.
>
> Am 22.01.2020 um 10:28 schrieb Karl Wright :
>
> 
> Has there been any news?
> I'd love to get this tied up so that you're able to proceed.
> Karl
>
> On Thu, Jan 16, 2020 at 12:08 PM Jörn Franke  wrote:
>
>> Ok I understand. I will try and let you know. Thanks again very much for
>> your fast and detailed answer. Really appreciated. I hope I can give back
>> with the solution to fetch WSDLs from https and maybe a solution to this
>> problem (maybe other will face this as well).
>>
>> About the connector: the WSDL is successfully fetched via https (not file
>> - no clue why) - after the modification I made. The only problem I see now
>> is that the xsd to which the WSDL is referring are not fetched. The bizarre
>> thing is that the https url that it mention for the xsd is absolutely
>> correct. So I assume it does not understand an http url, maybe that is
>> related to configuration.
>>
>> Am 16.01.2020 um 14:53 schrieb Karl Wright :
>>
>> 
>> The WSDLS are bundled with the jar.  We intended this to be the ONLY way
>> the wsdls were accessed, and made lots of changes to the wsdls accordingly,
>> so that they referenced other wsdls via the "file system".  The wsdls are
>> the fixed up ones that are used to build the java stubs locally, plus a
>> config file that's supposed to tell CXF how to resolve referenced wsdls.
>> That config file may or may not be correct, because we never were able to
>> get CXF to use the local resource wsdls during actual connection.
>>
>> Except now they seem to be both fetched via https AND locally sourced.  I
>> have no idea how that can be.  I had assumed it was done one way or the
>> other but not both.
>>
>> Perhaps the problem is that the configuration file is being read but the
>> resource wsdls are not being found?  Removing the meta-inf from the jar
>> would then force everything to go through https.  Ideally I'd love it if
>> that wasn't needed and we could get the resource fetch working everywhere.
>>
>>
>> Karl
>>
>>
>> On Thu, Jan 16, 2020 at 8:20 AM Jörn Franke  wrote:
>>
>>> Well i am not sure how they solved it - I will share a tested solution
>>> in Jira and everyone can check. Maybe their wsdl is accessible through http?
>>>
>>> What works is doing call through https,  but thee fetching of WSDL did
>>> not - as this is through another mechanism.
>>>
>>> I don’t think that the open text is different, the WSDL look very
>>> similar to the repository.
>>>
>>> The strange thing is that for this error message it tries to access the
>>> xsd through a https url (which is perfectly accessible for the server).
>>> Could it be that the connector restrict itself somehow to local file
>>> system only or similar?
>>> Have you faced this issue before?
>>>
>>>
>>>
>>> Am 16.01.2020 um 12:56 schrieb Karl Wright :
>>>
>>> 
>>> I should say that we have (AFAICT) at least two independent
>>> installations of the csws connector working in the field, at least one of
>>> them using secure connections.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Jan 16, 2020 at 6:54 AM Karl Wright  wrote:
>>>
>>>> We solved the WSDL fetching through HTTPS, or thought we had, by
>>>> restructuring the code according to a number of articles we found.  This
>>>> was supposedly tested and worked in one installation.  Nobody has ever
>>>> reported issues with the wsdls being fetched however; I worry that you may
>>>> have a

Re: CSWS Connector : ServiceConstructionException: Failed to create service

2020-01-22 Thread Karl Wright

Has there been any news?
I'd love to get this tied up so that you're able to proceed.
Karl

On Thu, Jan 16, 2020 at 12:08 PM Jörn Franke  wrote:

> Ok I understand. I will try and let you know. Thanks again very much for
> your fast and detailed answer. Really appreciated. I hope I can give back
> with the solution to fetch WSDLs from https and maybe a solution to this
> problem (maybe other will face this as well).
>
> About the connector: the WSDL is successfully fetched via https (not file
> - no clue why) - after the modification I made. The only problem I see now
> is that the xsd to which the WSDL is referring are not fetched. The bizarre
> thing is that the https url that it mention for the xsd is absolutely
> correct. So I assume it does not understand an http url, maybe that is
> related to configuration.
>
> Am 16.01.2020 um 14:53 schrieb Karl Wright :
>
> 
> The WSDLS are bundled with the jar.  We intended this to be the ONLY way
> the wsdls were accessed, and made lots of changes to the wsdls accordingly,
> so that they referenced other wsdls via the "file system".  The wsdls are
> the fixed up ones that are used to build the java stubs locally, plus a
> config file that's supposed to tell CXF how to resolve referenced wsdls.
> That config file may or may not be correct, because we never were able to
> get CXF to use the local resource wsdls during actual connection.
>
> Except now they seem to be both fetched via https AND locally sourced.  I
> have no idea how that can be.  I had assumed it was done one way or the
> other but not both.
>
> Perhaps the problem is that the configuration file is being read but the
> resource wsdls are not being found?  Removing the meta-inf from the jar
> would then force everything to go through https.  Ideally I'd love it if
> that wasn't needed and we could get the resource fetch working everywhere.
>
>
> Karl
>
>
> On Thu, Jan 16, 2020 at 8:20 AM Jörn Franke  wrote:
>
>> Well i am not sure how they solved it - I will share a tested solution in
>> Jira and everyone can check. Maybe their wsdl is accessible through http?
>>
>> What works is doing call through https,  but thee fetching of WSDL did
>> not - as this is through another mechanism.
>>
>> I don’t think that the open text is different, the WSDL look very similar
>> to the repository.
>>
>> The strange thing is that for this error message it tries to access the
>> xsd through a https url (which is perfectly accessible for the server).
>> Could it be that the connector restrict itself somehow to local file
>> system only or similar?
>> Have you faced this issue before?
>>
>>
>>
>> Am 16.01.2020 um 12:56 schrieb Karl Wright :
>>
>> 
>> I should say that we have (AFAICT) at least two independent installations
>> of the csws connector working in the field, at least one of them using
>> secure connections.
>>
>> Karl
>>
>>
>> On Thu, Jan 16, 2020 at 6:54 AM Karl Wright  wrote:
>>
>>> We solved the WSDL fetching through HTTPS, or thought we had, by
>>> restructuring the code according to a number of articles we found.  This
>>> was supposedly tested and worked in one installation.  Nobody has ever
>>> reported issues with the wsdls being fetched however; I worry that you may
>>> have a different version of OpenText that is incompatible with the one we
>>> developed against.  That's the problem with this kind of architecture;
>>> unless the wsdls are included in the jar there can be issues.  We tried to
>>> do that too but were unable to get it to work.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Jan 16, 2020 at 5:34 AM Jörn Franke 
>>> wrote:
>>>
>>>> Ok i fixed the source to fetch WSDL from https (it is not perfect yet
>>>> as it does not use the truststore yet but this I can fix) - I will share
>>>> later in a Jira.
>>>> It is however now unable to locate the imported document
>>>> /Authentication?xsd=2 relative to Authenticaton?wsdl#types1
>>>>
>>>> I will look into this, but if someone has come cross it then please let
>>>> me know.
>>>>
>>>> Am 16.01.2020 um 10:22 schrieb Jörn Franke :
>>>>
>>>> 
>>>> Coming back to the original topic. I believe SSL was never fully solved
>>>> from what i read in the corresponding issue. Apparently, the fetching of
>>>> the WSDL itself through https was not possible. Do you remember still some
>>>> insights beyond what is written in the issue ?
>>>

Re: CSWS Connector : ServiceConstructionException: Failed to create service

2020-01-16 Thread Karl Wright

The WSDLS are bundled with the jar.  We intended this to be the ONLY way
the wsdls were accessed, and made lots of changes to the wsdls accordingly,
so that they referenced other wsdls via the "file system".  The wsdls are
the fixed up ones that are used to build the java stubs locally, plus a
config file that's supposed to tell CXF how to resolve referenced wsdls.
That config file may or may not be correct, because we never were able to
get CXF to use the local resource wsdls during actual connection.

Except now they seem to be both fetched via https AND locally sourced.  I
have no idea how that can be.  I had assumed it was done one way or the
other but not both.

Perhaps the problem is that the configuration file is being read but the
resource wsdls are not being found?  Removing the meta-inf from the jar
would then force everything to go through https.  Ideally I'd love it if
that wasn't needed and we could get the resource fetch working everywhere.


Karl


On Thu, Jan 16, 2020 at 8:20 AM Jörn Franke  wrote:

> Well i am not sure how they solved it - I will share a tested solution in
> Jira and everyone can check. Maybe their wsdl is accessible through http?
>
> What works is doing call through https,  but thee fetching of WSDL did not
> - as this is through another mechanism.
>
> I don’t think that the open text is different, the WSDL look very similar
> to the repository.
>
> The strange thing is that for this error message it tries to access the
> xsd through a https url (which is perfectly accessible for the server).
> Could it be that the connector restrict itself somehow to local file
> system only or similar?
> Have you faced this issue before?
>
>
>
> Am 16.01.2020 um 12:56 schrieb Karl Wright :
>
> 
> I should say that we have (AFAICT) at least two independent installations
> of the csws connector working in the field, at least one of them using
> secure connections.
>
> Karl
>
>
> On Thu, Jan 16, 2020 at 6:54 AM Karl Wright  wrote:
>
>> We solved the WSDL fetching through HTTPS, or thought we had, by
>> restructuring the code according to a number of articles we found.  This
>> was supposedly tested and worked in one installation.  Nobody has ever
>> reported issues with the wsdls being fetched however; I worry that you may
>> have a different version of OpenText that is incompatible with the one we
>> developed against.  That's the problem with this kind of architecture;
>> unless the wsdls are included in the jar there can be issues.  We tried to
>> do that too but were unable to get it to work.
>>
>> Karl
>>
>>
>> On Thu, Jan 16, 2020 at 5:34 AM Jörn Franke  wrote:
>>
>>> Ok i fixed the source to fetch WSDL from https (it is not perfect yet as
>>> it does not use the truststore yet but this I can fix) - I will share later
>>> in a Jira.
>>> It is however now unable to locate the imported document
>>> /Authentication?xsd=2 relative to Authenticaton?wsdl#types1
>>>
>>> I will look into this, but if someone has come cross it then please let
>>> me know.
>>>
>>> Am 16.01.2020 um 10:22 schrieb Jörn Franke :
>>>
>>> 
>>> Coming back to the original topic. I believe SSL was never fully solved
>>> from what i read in the corresponding issue. Apparently, the fetching of
>>> the WSDL itself through https was not possible. Do you remember still some
>>> insights beyond what is written in the issue ?
>>>
>>> Am 16.01.2020 um 00:37 schrieb Karl Wright :
>>>
>>> 
>>> Let me think about that option.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Jan 15, 2020 at 5:38 PM Jörn Franke 
>>> wrote:
>>>
>>>> We could make it configurable, e.g. in properties.xml. Here people
>>>> could set it to SSL, TLS, TLSv1.2 (to restrict it to TLS1.2 => some
>>>> companies may want that!). Is this a viable option? That would be also
>>>> future proof. We can leave it by default to SSL, but we should put in the
>>>> example config files TLS by default (so new starters do not get even the
>>>> idea to use an outdated protocol) AND put a comment with recommendation to
>>>> use/enforce always newest protocols for security reasons. Of course, the
>>>> choice is then with the people using the software.
>>>> Could that be something sensible from your point of view?
>>>>
>>>> On Wed, Jan 15, 2020 at 11:14 PM Karl Wright 
>>>> wrote:
>>>>
>>>>> It's rather immaterial what browsers do here.  What's important is
>>>>> wha

Re: CSWS Connector : ServiceConstructionException: Failed to create service

2020-01-16 Thread Karl Wright

I should say that we have (AFAICT) at least two independent installations
of the csws connector working in the field, at least one of them using
secure connections.

Karl


On Thu, Jan 16, 2020 at 6:54 AM Karl Wright  wrote:

> We solved the WSDL fetching through HTTPS, or thought we had, by
> restructuring the code according to a number of articles we found.  This
> was supposedly tested and worked in one installation.  Nobody has ever
> reported issues with the wsdls being fetched however; I worry that you may
> have a different version of OpenText that is incompatible with the one we
> developed against.  That's the problem with this kind of architecture;
> unless the wsdls are included in the jar there can be issues.  We tried to
> do that too but were unable to get it to work.
>
> Karl
>
>
> On Thu, Jan 16, 2020 at 5:34 AM Jörn Franke  wrote:
>
>> Ok i fixed the source to fetch WSDL from https (it is not perfect yet as
>> it does not use the truststore yet but this I can fix) - I will share later
>> in a Jira.
>> It is however now unable to locate the imported document
>> /Authentication?xsd=2 relative to Authenticaton?wsdl#types1
>>
>> I will look into this, but if someone has come cross it then please let
>> me know.
>>
>> Am 16.01.2020 um 10:22 schrieb Jörn Franke :
>>
>> 
>> Coming back to the original topic. I believe SSL was never fully solved
>> from what i read in the corresponding issue. Apparently, the fetching of
>> the WSDL itself through https was not possible. Do you remember still some
>> insights beyond what is written in the issue ?
>>
>> Am 16.01.2020 um 00:37 schrieb Karl Wright :
>>
>> 
>> Let me think about that option.
>>
>> Karl
>>
>>
>> On Wed, Jan 15, 2020 at 5:38 PM Jörn Franke  wrote:
>>
>>> We could make it configurable, e.g. in properties.xml. Here people could
>>> set it to SSL, TLS, TLSv1.2 (to restrict it to TLS1.2 => some companies may
>>> want that!). Is this a viable option? That would be also future proof. We
>>> can leave it by default to SSL, but we should put in the example config
>>> files TLS by default (so new starters do not get even the idea to use an
>>> outdated protocol) AND put a comment with recommendation to use/enforce
>>> always newest protocols for security reasons. Of course, the choice is then
>>> with the people using the software.
>>> Could that be something sensible from your point of view?
>>>
>>> On Wed, Jan 15, 2020 at 11:14 PM Karl Wright  wrote:
>>>
>>>> It's rather immaterial what browsers do here.  What's important is
>>>> what  *existing servers* support, since that is what we're connecting with.
>>>>
>>>> I tend to agree that *most* people have probably upgraded to web
>>>> servers that support TLS.  But we can't guarantee it, nor can we assume
>>>> that people have upgraded to the most modern version of TLS exclusively.
>>>> In fact I think we can assume they have *not*.  When the SSL issues were
>>>> discovered a couple of years back, the standard recommendation was simply
>>>> to *disable* SSLv1 and SSLv2, not to upgrade to Java 11 or some such.  We
>>>> still support (and have people using!!) early forms of NTLM (v1 to be
>>>> specific), for instance.  We're not going to be able to wag the dog here.
>>>> Breaking changes of this kind usually mean we go to a whole new major
>>>> version of MCF.
>>>>
>>>> However, if you can show that SSLContext.getSSLFactory("TLS") produces
>>>> a SSLSocketFactory that works with all versions of TLS and SSL that do not
>>>> have known security holes, I would support changing over to that.  If it
>>>> turns out we need much more specificity about the kind of SSLSocketFactory
>>>> we produce, then we need a better solution anyhow for handling multiple
>>>> protocols in one socket factory.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Wed, Jan 15, 2020 at 5:17 AM Jörn Franke 
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> No it does not. I can look into that further, but Current browsers
>>>>> stop supporting anything below TLSv1.2 in March 2020.
>>>>> Then TLS exists since more than ten years. I expect any server running
>>>>> nowadays will always have tls support.
>>>>> SSL itself is not supported since some time now. From a security
>>>>> perspe

Re: CSWS Connector : ServiceConstructionException: Failed to create service

2020-01-16 Thread Karl Wright

We solved the WSDL fetching through HTTPS, or thought we had, by
restructuring the code according to a number of articles we found.  This
was supposedly tested and worked in one installation.  Nobody has ever
reported issues with the wsdls being fetched however; I worry that you may
have a different version of OpenText that is incompatible with the one we
developed against.  That's the problem with this kind of architecture;
unless the wsdls are included in the jar there can be issues.  We tried to
do that too but were unable to get it to work.

Karl


On Thu, Jan 16, 2020 at 5:34 AM Jörn Franke  wrote:

> Ok i fixed the source to fetch WSDL from https (it is not perfect yet as
> it does not use the truststore yet but this I can fix) - I will share later
> in a Jira.
> It is however now unable to locate the imported document
> /Authentication?xsd=2 relative to Authenticaton?wsdl#types1
>
> I will look into this, but if someone has come cross it then please let me
> know.
>
> Am 16.01.2020 um 10:22 schrieb Jörn Franke :
>
> 
> Coming back to the original topic. I believe SSL was never fully solved
> from what i read in the corresponding issue. Apparently, the fetching of
> the WSDL itself through https was not possible. Do you remember still some
> insights beyond what is written in the issue ?
>
> Am 16.01.2020 um 00:37 schrieb Karl Wright :
>
> 
> Let me think about that option.
>
> Karl
>
>
> On Wed, Jan 15, 2020 at 5:38 PM Jörn Franke  wrote:
>
>> We could make it configurable, e.g. in properties.xml. Here people could
>> set it to SSL, TLS, TLSv1.2 (to restrict it to TLS1.2 => some companies may
>> want that!). Is this a viable option? That would be also future proof. We
>> can leave it by default to SSL, but we should put in the example config
>> files TLS by default (so new starters do not get even the idea to use an
>> outdated protocol) AND put a comment with recommendation to use/enforce
>> always newest protocols for security reasons. Of course, the choice is then
>> with the people using the software.
>> Could that be something sensible from your point of view?
>>
>> On Wed, Jan 15, 2020 at 11:14 PM Karl Wright  wrote:
>>
>>> It's rather immaterial what browsers do here.  What's important is what
>>> *existing servers* support, since that is what we're connecting with.
>>>
>>> I tend to agree that *most* people have probably upgraded to web servers
>>> that support TLS.  But we can't guarantee it, nor can we assume that people
>>> have upgraded to the most modern version of TLS exclusively.  In fact I
>>> think we can assume they have *not*.  When the SSL issues were discovered a
>>> couple of years back, the standard recommendation was simply to *disable*
>>> SSLv1 and SSLv2, not to upgrade to Java 11 or some such.  We still support
>>> (and have people using!!) early forms of NTLM (v1 to be specific), for
>>> instance.  We're not going to be able to wag the dog here.  Breaking
>>> changes of this kind usually mean we go to a whole new major version of MCF.
>>>
>>> However, if you can show that SSLContext.getSSLFactory("TLS") produces a
>>> SSLSocketFactory that works with all versions of TLS and SSL that do not
>>> have known security holes, I would support changing over to that.  If it
>>> turns out we need much more specificity about the kind of SSLSocketFactory
>>> we produce, then we need a better solution anyhow for handling multiple
>>> protocols in one socket factory.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Jan 15, 2020 at 5:17 AM Jörn Franke 
>>> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> No it does not. I can look into that further, but Current browsers stop
>>>> supporting anything below TLSv1.2 in March 2020.
>>>> Then TLS exists since more than ten years. I expect any server running
>>>> nowadays will always have tls support.
>>>> SSL itself is not supported since some time now. From a security
>>>> perspective it should even break servers that run only SSL as they are
>>>> inherently insecure and also clients that only support SSL are adding to
>>>> this.
>>>> However if you have an idea how this should be made configurable then I
>>>> can look into this.
>>>>
>>>> Best regards
>>>>
>>>> Am 15.01.2020 um 10:52 schrieb Karl Wright :
>>>>
>>>> 
>>>> Hi,
>>>>
>>>> Mcf currently requires jdk8.  Jdk11 is non trivial to support because
>>>> o

Re: CSWS Connector : ServiceConstructionException: Failed to create service

2020-01-15 Thread Karl Wright

Let me think about that option.

Karl


On Wed, Jan 15, 2020 at 5:38 PM Jörn Franke  wrote:

> We could make it configurable, e.g. in properties.xml. Here people could
> set it to SSL, TLS, TLSv1.2 (to restrict it to TLS1.2 => some companies may
> want that!). Is this a viable option? That would be also future proof. We
> can leave it by default to SSL, but we should put in the example config
> files TLS by default (so new starters do not get even the idea to use an
> outdated protocol) AND put a comment with recommendation to use/enforce
> always newest protocols for security reasons. Of course, the choice is then
> with the people using the software.
> Could that be something sensible from your point of view?
>
> On Wed, Jan 15, 2020 at 11:14 PM Karl Wright  wrote:
>
>> It's rather immaterial what browsers do here.  What's important is what
>> *existing servers* support, since that is what we're connecting with.
>>
>> I tend to agree that *most* people have probably upgraded to web servers
>> that support TLS.  But we can't guarantee it, nor can we assume that people
>> have upgraded to the most modern version of TLS exclusively.  In fact I
>> think we can assume they have *not*.  When the SSL issues were discovered a
>> couple of years back, the standard recommendation was simply to *disable*
>> SSLv1 and SSLv2, not to upgrade to Java 11 or some such.  We still support
>> (and have people using!!) early forms of NTLM (v1 to be specific), for
>> instance.  We're not going to be able to wag the dog here.  Breaking
>> changes of this kind usually mean we go to a whole new major version of MCF.
>>
>> However, if you can show that SSLContext.getSSLFactory("TLS") produces a
>> SSLSocketFactory that works with all versions of TLS and SSL that do not
>> have known security holes, I would support changing over to that.  If it
>> turns out we need much more specificity about the kind of SSLSocketFactory
>> we produce, then we need a better solution anyhow for handling multiple
>> protocols in one socket factory.
>>
>> Karl
>>
>>
>> On Wed, Jan 15, 2020 at 5:17 AM Jörn Franke  wrote:
>>
>>> Hi Karl,
>>>
>>> No it does not. I can look into that further, but Current browsers stop
>>> supporting anything below TLSv1.2 in March 2020.
>>> Then TLS exists since more than ten years. I expect any server running
>>> nowadays will always have tls support.
>>> SSL itself is not supported since some time now. From a security
>>> perspective it should even break servers that run only SSL as they are
>>> inherently insecure and also clients that only support SSL are adding to
>>> this.
>>> However if you have an idea how this should be made configurable then I
>>> can look into this.
>>>
>>> Best regards
>>>
>>> Am 15.01.2020 um 10:52 schrieb Karl Wright :
>>>
>>> 
>>> Hi,
>>>
>>> Mcf currently requires jdk8.  Jdk11 is non trivial to support because of
>>> the removal of many jdk classes connectors need.  It will be ported at some
>>> point but not lightly.
>>>
>>> Similarly, disabling SSL would certainly break many installations upon
>>> upgrade  and we do not do that lightly.
>>>
>>> The core methods that mcf supplies its connectors should therefore be
>>> updated to support but not mandate tls.  The protocol specification one
>>> gives to sslcontext is not a detailed one but rather a major version.  What
>>> I don't know is whether"tlsv1" also allows for older protocols etc.
>>>
>>> Karl
>>>
>>> On Wed, Jan 15, 2020, 1:19 AM Jörn Franke  wrote:
>>>
>>>> Yes I am doing that but I will need to rebuild.
>>>> I don’t recommend TLSv1 - this is already outphased and will lock out
>>>> TLSv1.2. I try TLS only as it includes all TLS protocols (depends on JDK).
>>>>
>>>> SSL will not be supported by this (however as I said there are other
>>>> parts of the code where there is a getInstance(TLS). And some caveats: On
>>>> JDK6+7 TLS only means TLSv1 (and newer TLS Protocols are deactivated) on
>>>> JDK8 it means also that newer TLS protocols are enabled.
>>>> To be honest in my opinion - a SSL only one is a significant security
>>>> hole and given how old TLS support is JDK i would be surprised if there is
>>>> someone using such a server (most Organisations should switch to TLSv1.2 in
>>>> any case as all protocols below have been broken).
>>>> While

Re: CSWS Connector : ServiceConstructionException: Failed to create service

2020-01-15 Thread Karl Wright

It's rather immaterial what browsers do here.  What's important is what
*existing servers* support, since that is what we're connecting with.

I tend to agree that *most* people have probably upgraded to web servers
that support TLS.  But we can't guarantee it, nor can we assume that people
have upgraded to the most modern version of TLS exclusively.  In fact I
think we can assume they have *not*.  When the SSL issues were discovered a
couple of years back, the standard recommendation was simply to *disable*
SSLv1 and SSLv2, not to upgrade to Java 11 or some such.  We still support
(and have people using!!) early forms of NTLM (v1 to be specific), for
instance.  We're not going to be able to wag the dog here.  Breaking
changes of this kind usually mean we go to a whole new major version of MCF.

However, if you can show that SSLContext.getSSLFactory("TLS") produces a
SSLSocketFactory that works with all versions of TLS and SSL that do not
have known security holes, I would support changing over to that.  If it
turns out we need much more specificity about the kind of SSLSocketFactory
we produce, then we need a better solution anyhow for handling multiple
protocols in one socket factory.

Karl


On Wed, Jan 15, 2020 at 5:17 AM Jörn Franke  wrote:

> Hi Karl,
>
> No it does not. I can look into that further, but Current browsers stop
> supporting anything below TLSv1.2 in March 2020.
> Then TLS exists since more than ten years. I expect any server running
> nowadays will always have tls support.
> SSL itself is not supported since some time now. From a security
> perspective it should even break servers that run only SSL as they are
> inherently insecure and also clients that only support SSL are adding to
> this.
> However if you have an idea how this should be made configurable then I
> can look into this.
>
> Best regards
>
> Am 15.01.2020 um 10:52 schrieb Karl Wright :
>
> 
> Hi,
>
> Mcf currently requires jdk8.  Jdk11 is non trivial to support because of
> the removal of many jdk classes connectors need.  It will be ported at some
> point but not lightly.
>
> Similarly, disabling SSL would certainly break many installations upon
> upgrade  and we do not do that lightly.
>
> The core methods that mcf supplies its connectors should therefore be
> updated to support but not mandate tls.  The protocol specification one
> gives to sslcontext is not a detailed one but rather a major version.  What
> I don't know is whether"tlsv1" also allows for older protocols etc.
>
> Karl
>
> On Wed, Jan 15, 2020, 1:19 AM Jörn Franke  wrote:
>
>> Yes I am doing that but I will need to rebuild.
>> I don’t recommend TLSv1 - this is already outphased and will lock out
>> TLSv1.2. I try TLS only as it includes all TLS protocols (depends on JDK).
>>
>> SSL will not be supported by this (however as I said there are other
>> parts of the code where there is a getInstance(TLS). And some caveats: On
>> JDK6+7 TLS only means TLSv1 (and newer TLS Protocols are deactivated) on
>> JDK8 it means also that newer TLS protocols are enabled.
>> To be honest in my opinion - a SSL only one is a significant security
>> hole and given how old TLS support is JDK i would be surprised if there is
>> someone using such a server (most Organisations should switch to TLSv1.2 in
>> any case as all protocols below have been broken).
>> While it works for all JDKs - probably JDK8 should be recommended as it
>> seems to have all TLS protocols activated when using „TLS“. Older JDKs seem
>> to deactivate TLSv1.1 and TLSv1.2 when using TLS. I will write more about
>> this in the JIRA, once I verified that this solves the problem.
>> Then TLSv1.3 is JDK11 only - I will investigate what that implies.
>> Does ManifoldCf supports JDK11?
>>
>> Am 15.01.2020 um 00:08 schrieb Karl Wright :
>>
>> 
>> I think you can just change the code to read as follows when it creates
>> the SSLContext:
>>
>> SSLContext ctx = SSLContext.getInstance("TLSv1");
>> I don't know if TLS will downgrade to SSL if that's all that's available.
>>
>>
>> Karl
>>
>>
>>
>> On Tue, Jan 14, 2020 at 6:02 PM Jörn Franke  wrote:
>>
>>> Yes it you do not change this setting as what I suspect happens here.
>>> See my previous mail for details.
>>>
>>> Am 14.01.2020 um 23:51 schrieb Karl Wright :
>>>
>>> 
>>> It looks looks TLS is actually enabled in the SSLSocketFactory framework
>>> based on how you create the SSLSocketContext.  See:
>>>
>>> https://docs.oracle.com/cd/E19698-01/816-7609/security-83/index.html
>>>
>>> Karl
>>>
>>>

Re: CSWS Connector : ServiceConstructionException: Failed to create service

2020-01-15 Thread Karl Wright

Hi,

Mcf currently requires jdk8.  Jdk11 is non trivial to support because of
the removal of many jdk classes connectors need.  It will be ported at some
point but not lightly.

Similarly, disabling SSL would certainly break many installations upon
upgrade  and we do not do that lightly.

The core methods that mcf supplies its connectors should therefore be
updated to support but not mandate tls.  The protocol specification one
gives to sslcontext is not a detailed one but rather a major version.  What
I don't know is whether"tlsv1" also allows for older protocols etc.

Karl

On Wed, Jan 15, 2020, 1:19 AM Jörn Franke  wrote:

> Yes I am doing that but I will need to rebuild.
> I don’t recommend TLSv1 - this is already outphased and will lock out
> TLSv1.2. I try TLS only as it includes all TLS protocols (depends on JDK).
>
> SSL will not be supported by this (however as I said there are other parts
> of the code where there is a getInstance(TLS). And some caveats: On JDK6+7
> TLS only means TLSv1 (and newer TLS Protocols are deactivated) on JDK8 it
> means also that newer TLS protocols are enabled.
> To be honest in my opinion - a SSL only one is a significant security hole
> and given how old TLS support is JDK i would be surprised if there is
> someone using such a server (most Organisations should switch to TLSv1.2 in
> any case as all protocols below have been broken).
> While it works for all JDKs - probably JDK8 should be recommended as it
> seems to have all TLS protocols activated when using „TLS“. Older JDKs seem
> to deactivate TLSv1.1 and TLSv1.2 when using TLS. I will write more about
> this in the JIRA, once I verified that this solves the problem.
> Then TLSv1.3 is JDK11 only - I will investigate what that implies.
> Does ManifoldCf supports JDK11?
>
> Am 15.01.2020 um 00:08 schrieb Karl Wright :
>
> 
> I think you can just change the code to read as follows when it creates
> the SSLContext:
>
> SSLContext ctx = SSLContext.getInstance("TLSv1");
> I don't know if TLS will downgrade to SSL if that's all that's available.
>
>
> Karl
>
>
>
> On Tue, Jan 14, 2020 at 6:02 PM Jörn Franke  wrote:
>
>> Yes it you do not change this setting as what I suspect happens here. See
>> my previous mail for details.
>>
>> Am 14.01.2020 um 23:51 schrieb Karl Wright :
>>
>> 
>> It looks looks TLS is actually enabled in the SSLSocketFactory framework
>> based on how you create the SSLSocketContext.  See:
>>
>> https://docs.oracle.com/cd/E19698-01/816-7609/security-83/index.html
>>
>> Karl
>>
>>
>> On Tue, Jan 14, 2020 at 5:48 PM Karl Wright  wrote:
>>
>>> The design of ManifoldCF deliberately manages keystores on a connection
>>> by connection basis, not globally.  If you think the only way to implement
>>> TLS is via global keystore I very much doubt it.
>>>
>>> I am on the road until late tomorrow but somewhere along the line I can
>>> do some research into why TLS won't work as we are currently doing it.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Jan 14, 2020 at 12:56 PM Jörn Franke 
>>> wrote:
>>>
>>>> These are TLS only. So maybe you have other servers where tls and ssl
>>>> are possible and it downgrades to ssl.however, this is speculation and I
>>>> need to verify it. I have to rebuilt manifold for that. Probably I have to
>>>> reinstall everything as the keystorefactory is a dependency in the
>>>> connector.
>>>>
>>>> Am 14.01.2020 um 18:34 schrieb Karl Wright :
>>>>
>>>> 
>>>> If you can recommend changes to support TLS, that would be great.  The
>>>> basic infrastructure should still work; it is just a custom keystone and
>>>> associated SSLSocketFactory, which I think also is used for TLS
>>>> connections, unless I am missing something.
>>>>
>>>> On Tue, Jan 14, 2020, 9:38 AM Jörn Franke  wrote:
>>>>
>>>>> Yes this works fine. I believe the error comes from the fact that TLS
>>>>> connections are not supported.
>>>>>
>>>>> Am 14.01.2020 um 15:31 schrieb Michael Cizmar <
>>>>> michael.ciz...@mcplusa.com>:
>>>>>
>>>>> 
>>>>>
>>>>> If you want to test the url and the ssl, I would recommend attempting
>>>>> using SSLPoke to confirm that they keystore is setup properly:
>>>>>
>>>>>
>>>>>
>>>>> https://github.com/MichalHecko/SSLPoke
>>>>>
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1521 matches

Mail list logo