Re: Manifold CF-Non existent of URL

2019-09-02 Thread Priya Arora
Yes its a  continuous   Job.

On Tue, Sep 3, 2019 at 11:05 AM Priya Arora  wrote:

> Hi ,
> I am having a job Job:-myuniversity_intranet (which is crawling data from
> intranet site) and the data has been indexed in an index.
> My query here is, does manifold have some functionality to test a url
> before indexing that whether the URL is existing or not?.
> Likewise , in my index (say index name: abc), i am having URL(indexed).
> URL:- https:myuniversity/reaserch/info(which is an intranet url). This URL
> was existing earlier but not existing now, and resulting status is 404.
>
> Query is :- Can monifoldcf checks before indexing whether its status is
> not equal to 404(that means it exists). if the URL exists in real only then
> index otherwise skip that URL.
> Does this setting can be implemented while configuring manifold cf job.,
> or do I have to manually handle this in code.
>
>
> Kind regards
> Priya
>
> On Mon, Sep 2, 2019 at 8:19 PM Karl Wright  wrote:
>
>> Hi,
>> You aren't giving me enough information to know why your job isn't
>> rechecking URLs.  Please tell me how your job is configured, specifically
>> whether it's continuous or not.  Thanks.
>>
>> Karl
>>
>>
>> On Mon, Sep 2, 2019 at 4:47 AM Priya Arora  wrote:
>>
>> > Hi,
>> >
>> > I have a query regarding manifoldCF. Is this having some kind of
>> > functionality to check, if the URL it is crawling, does exist actually
>> or
>> > page not found(404).
>> >
>> > Like I have a requirement in which i am crawling data for university and
>> > job i continuously running.After some period it found that the certain
>> > URL's have been removed from University site but its is getting indexed
>> > still also.
>> >
>> > Some pages have been marked as status 404.
>> >  How can manifold be automatise to check this , that if the URL is
>> > corresponding to 404(does not  exist anymore), it should be indexed
>> >
>> > Thanks
>> > Priya.
>> >
>>
>


Re: Manifold CF-Non existent of URL

2019-09-02 Thread Priya Arora
Hi ,
I am having a job Job:-myuniversity_intranet (which is crawling data from
intranet site) and the data has been indexed in an index.
My query here is, does manifold have some functionality to test a url
before indexing that whether the URL is existing or not?.
Likewise , in my index (say index name: abc), i am having URL(indexed).
URL:- https:myuniversity/reaserch/info(which is an intranet url). This URL
was existing earlier but not existing now, and resulting status is 404.

Query is :- Can monifoldcf checks before indexing whether its status is not
equal to 404(that means it exists). if the URL exists in real only then
index otherwise skip that URL.
Does this setting can be implemented while configuring manifold cf job., or
do I have to manually handle this in code.


Kind regards
Priya

On Mon, Sep 2, 2019 at 8:19 PM Karl Wright  wrote:

> Hi,
> You aren't giving me enough information to know why your job isn't
> rechecking URLs.  Please tell me how your job is configured, specifically
> whether it's continuous or not.  Thanks.
>
> Karl
>
>
> On Mon, Sep 2, 2019 at 4:47 AM Priya Arora  wrote:
>
> > Hi,
> >
> > I have a query regarding manifoldCF. Is this having some kind of
> > functionality to check, if the URL it is crawling, does exist actually or
> > page not found(404).
> >
> > Like I have a requirement in which i am crawling data for university and
> > job i continuously running.After some period it found that the certain
> > URL's have been removed from University site but its is getting indexed
> > still also.
> >
> > Some pages have been marked as status 404.
> >  How can manifold be automatise to check this , that if the URL is
> > corresponding to 404(does not  exist anymore), it should be indexed
> >
> > Thanks
> > Priya.
> >
>


[jira] [Commented] (CONNECTORS-1566) Develop CSWS connector as a replacement for deprecated LiveLink LAPI connector

2019-09-02 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921084#comment-16921084
 ] 

Karl Wright commented on CONNECTORS-1566:
-

The only thing that is preventing this from going live now is the ability to 
get metadata for documents.  Specifically, I need the following method:

{code}
String[] getAttributeValues(long docID, long catID);
{code}

[~schuch], the engineer I'm working with elsewhere seems to be incapable of 
discovering how this is done.  Do you have anyone where you work who may be 
able to find the answer?



> Develop CSWS connector as a replacement for deprecated LiveLink LAPI connector
> --
>
> Key: CONNECTORS-1566
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1566
> Project: ManifoldCF
>  Issue Type: Task
>  Components: LiveLink connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.14
>
> Attachments: OTCS_IIS.png, OTCS_Tomcat.png, chrome_cgfC00ujx7.png
>
>
> LAPI is being deprecated.  We need to develop a replacement for it using the 
> ContentServer Web Services API.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (CONNECTORS-1508) Add support for French Language

2019-09-02 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1508:

Fix Version/s: (was: ManifoldCF 2.14)
   ManifoldCF 2.15

> Add support for French Language
> ---
>
> Key: CONNECTORS-1508
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1508
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: ManifoldCF 2.10
>Reporter: Cedric Ulmer
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.15
>
> Attachments: cedricmanifold_fr.zip
>
>
> Some users may need a French version of the ressource bundle. I attached a 
> preliminary translation that France Labs made some time ago (probably around 
> summer 2016), but that we halted due to lack of time (and priority). It is 
> probably almost complete, but some quality checking needs to be done. Note 
> also that I forgot to check the version when I did the translations, so 
> anyone interested would need to check any modifications that may have 
> occurred between this version and the current MCF version.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones

2019-09-02 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1521:

Fix Version/s: (was: ManifoldCF 2.14)
   ManifoldCF 2.15

> Documentum Connector users ManifoldCF's local time in queries constraints 
> against the Documentum server without reference to time zones
> ---
>
> Key: CONNECTORS-1521
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1521
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.15
>
>
> I find that the time/date constraints in queries to the Documentum server are 
> based on the "raw" local time of the ManifoldCF server but appear to take no 
> account of the time zones of the two servers.
> This can lead to recently modified files not being transferred to the output 
> repository when you would naturally expect them to be. I'd like the times to 
> be aligned, perhaps by including time zone in the query. In particular, is 
> there a way to use UTC perhaps?
> Here's an example ...
>  * create a folder in Documentum
>  * set up a job to point at the folder and output to the file system
>  * put two documents into a folder in Documentum
>  * Select them, right click and export as CSV (to show the timestamps):
> {noformat}
> 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM,
> 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat}
> Check the local time on the ManifoldCF server machine. Observe that it's 
> reporting consistent time with the DM server:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 09:07:25 BST 2018{noformat}
> Start the job and look for the query to Documentum in the manifoldcf.log file 
> (line break added for readability):
> {noformat}
> DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and
> r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') 
> AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) ))
> ^C{noformat}
> Notice that the latest date asked for is *before* the modification date of 
> the files added to DM. (And is an hour out, see footnote.)
>   
>  See whether anything has been output by the File System connector. It hasn't:
> {noformat}
> [james@manifold]$ ls /bigdisc/source/PDFs/timezones/
> [james@manifold]$
> {noformat}
> Now:
>  * change the timezone on the ManifoldCF server machine
>  * restart the ManifoldCF server and the Documentum processes
>  * reseed the job
> Check the local time on the ManifoldCF server machine; it has changed:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 10:10:29 CEST 2018{noformat}
> Start the job again and notice that the query has changed by an hour, plus 
> the few minutes it took to change the date etc (and is still an hour out, see 
> footnote):
> {noformat}
> r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') 
> {noformat}
> Observe that the range of dates now covers the timestamps on the DM data, and 
> also that some data has now been transferred by the File System connector:
> {noformat}
> [james@manifold]$ ls 
> /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/
> drl?versionLabel=CURRENT=09018000e515
> drl?versionLabel=CURRENT=09018000e516
> {noformat}
>  
>  
> [Footnote] It appears that something is trying to take account of Daylight 
> Saving Time too.
> If I set the server date to a time outside of DST, the query is aligned with 
> the current time:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Mon Oct 29 00:01:13 CET 2018
> r_modify_date<=date('10/29/2018 00:01:39','mm/dd/ hh:mi:ss') 
> {noformat}
> But if I set the time inside DST, the time is an hour before:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Sat Oct 27 00:00:06 CEST 2018
> r_modify_date<=date('10/26/2018 23:00:26','mm/dd/ hh:mi:ss') 
> {noformat}
> This is perhaps a Java issue rather than a logic issue in the connector? See 
> e.g. [https://stackoverflow.com/questions/6392/java-time-zone-is-messed-up]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (CONNECTORS-1622) Upgrade to Tika 1.22

2019-09-02 Thread Cihad Guzel (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cihad Guzel updated CONNECTORS-1622:

Summary: Upgrade to Tika 1.22  (was: Upgrade to Tika 1.22 when available)

> Upgrade to Tika 1.22
> 
>
> Key: CONNECTORS-1622
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1622
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.13
>Reporter: Cihad Guzel
>Priority: Major
> Fix For: ManifoldCF next
>
>
> Tika has released 1.22. Changes can be found from here: 
> http://tika.apache.org/1.22/



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (CONNECTORS-1622) Upgrade to Tika 1.22 when available

2019-09-02 Thread Cihad Guzel (Jira)
Cihad Guzel created CONNECTORS-1622:
---

 Summary: Upgrade to Tika 1.22 when available
 Key: CONNECTORS-1622
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1622
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Tika extractor
Affects Versions: ManifoldCF 2.13
Reporter: Cihad Guzel
 Fix For: ManifoldCF next


Tika has released 1.22. Changes can be found from here: 
http://tika.apache.org/1.22/



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: Manifold CF-Non existent of URL

2019-09-02 Thread Karl Wright
Hi,
You aren't giving me enough information to know why your job isn't
rechecking URLs.  Please tell me how your job is configured, specifically
whether it's continuous or not.  Thanks.

Karl


On Mon, Sep 2, 2019 at 4:47 AM Priya Arora  wrote:

> Hi,
>
> I have a query regarding manifoldCF. Is this having some kind of
> functionality to check, if the URL it is crawling, does exist actually or
> page not found(404).
>
> Like I have a requirement in which i am crawling data for university and
> job i continuously running.After some period it found that the certain
> URL's have been removed from University site but its is getting indexed
> still also.
>
> Some pages have been marked as status 404.
>  How can manifold be automatise to check this , that if the URL is
> corresponding to 404(does not  exist anymore), it should be indexed
>
> Thanks
> Priya.
>


Manifold CF-Non existent of URL

2019-09-02 Thread Priya Arora
Hi,

I have a query regarding manifoldCF. Is this having some kind of
functionality to check, if the URL it is crawling, does exist actually or
page not found(404).

Like I have a requirement in which i am crawling data for university and
job i continuously running.After some period it found that the certain
URL's have been removed from University site but its is getting indexed
still also.

Some pages have been marked as status 404.
 How can manifold be automatise to check this , that if the URL is
corresponding to 404(does not  exist anymore), it should be indexed

Thanks
Priya.