Re: Manifold CF-Non existent of URL
Yes its a continuous Job. On Tue, Sep 3, 2019 at 11:05 AM Priya Arora wrote: > Hi , > I am having a job Job:-myuniversity_intranet (which is crawling data from > intranet site) and the data has been indexed in an index. > My query here is, does manifold have some functionality to test a url > before indexing that whether the URL is existing or not?. > Likewise , in my index (say index name: abc), i am having URL(indexed). > URL:- https:myuniversity/reaserch/info(which is an intranet url). This URL > was existing earlier but not existing now, and resulting status is 404. > > Query is :- Can monifoldcf checks before indexing whether its status is > not equal to 404(that means it exists). if the URL exists in real only then > index otherwise skip that URL. > Does this setting can be implemented while configuring manifold cf job., > or do I have to manually handle this in code. > > > Kind regards > Priya > > On Mon, Sep 2, 2019 at 8:19 PM Karl Wright wrote: > >> Hi, >> You aren't giving me enough information to know why your job isn't >> rechecking URLs. Please tell me how your job is configured, specifically >> whether it's continuous or not. Thanks. >> >> Karl >> >> >> On Mon, Sep 2, 2019 at 4:47 AM Priya Arora wrote: >> >> > Hi, >> > >> > I have a query regarding manifoldCF. Is this having some kind of >> > functionality to check, if the URL it is crawling, does exist actually >> or >> > page not found(404). >> > >> > Like I have a requirement in which i am crawling data for university and >> > job i continuously running.After some period it found that the certain >> > URL's have been removed from University site but its is getting indexed >> > still also. >> > >> > Some pages have been marked as status 404. >> > How can manifold be automatise to check this , that if the URL is >> > corresponding to 404(does not exist anymore), it should be indexed >> > >> > Thanks >> > Priya. >> > >> >
Re: Manifold CF-Non existent of URL
Hi , I am having a job Job:-myuniversity_intranet (which is crawling data from intranet site) and the data has been indexed in an index. My query here is, does manifold have some functionality to test a url before indexing that whether the URL is existing or not?. Likewise , in my index (say index name: abc), i am having URL(indexed). URL:- https:myuniversity/reaserch/info(which is an intranet url). This URL was existing earlier but not existing now, and resulting status is 404. Query is :- Can monifoldcf checks before indexing whether its status is not equal to 404(that means it exists). if the URL exists in real only then index otherwise skip that URL. Does this setting can be implemented while configuring manifold cf job., or do I have to manually handle this in code. Kind regards Priya On Mon, Sep 2, 2019 at 8:19 PM Karl Wright wrote: > Hi, > You aren't giving me enough information to know why your job isn't > rechecking URLs. Please tell me how your job is configured, specifically > whether it's continuous or not. Thanks. > > Karl > > > On Mon, Sep 2, 2019 at 4:47 AM Priya Arora wrote: > > > Hi, > > > > I have a query regarding manifoldCF. Is this having some kind of > > functionality to check, if the URL it is crawling, does exist actually or > > page not found(404). > > > > Like I have a requirement in which i am crawling data for university and > > job i continuously running.After some period it found that the certain > > URL's have been removed from University site but its is getting indexed > > still also. > > > > Some pages have been marked as status 404. > > How can manifold be automatise to check this , that if the URL is > > corresponding to 404(does not exist anymore), it should be indexed > > > > Thanks > > Priya. > > >
[jira] [Commented] (CONNECTORS-1566) Develop CSWS connector as a replacement for deprecated LiveLink LAPI connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921084#comment-16921084 ] Karl Wright commented on CONNECTORS-1566: - The only thing that is preventing this from going live now is the ability to get metadata for documents. Specifically, I need the following method: {code} String[] getAttributeValues(long docID, long catID); {code} [~schuch], the engineer I'm working with elsewhere seems to be incapable of discovering how this is done. Do you have anyone where you work who may be able to find the answer? > Develop CSWS connector as a replacement for deprecated LiveLink LAPI connector > -- > > Key: CONNECTORS-1566 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1566 > Project: ManifoldCF > Issue Type: Task > Components: LiveLink connector >Affects Versions: ManifoldCF 2.12 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.14 > > Attachments: OTCS_IIS.png, OTCS_Tomcat.png, chrome_cgfC00ujx7.png > > > LAPI is being deprecated. We need to develop a replacement for it using the > ContentServer Web Services API. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (CONNECTORS-1508) Add support for French Language
[ https://issues.apache.org/jira/browse/CONNECTORS-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1508: Fix Version/s: (was: ManifoldCF 2.14) ManifoldCF 2.15 > Add support for French Language > --- > > Key: CONNECTORS-1508 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1508 > Project: ManifoldCF > Issue Type: Improvement > Components: Documentation >Affects Versions: ManifoldCF 2.10 >Reporter: Cedric Ulmer >Assignee: Karl Wright >Priority: Minor > Fix For: ManifoldCF 2.15 > > Attachments: cedricmanifold_fr.zip > > > Some users may need a French version of the ressource bundle. I attached a > preliminary translation that France Labs made some time ago (probably around > summer 2016), but that we halted due to lack of time (and priority). It is > probably almost complete, but some quality checking needs to be done. Note > also that I forgot to check the version when I did the translations, so > anyone interested would need to check any modifications that may have > occurred between this version and the current MCF version. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones
[ https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1521: Fix Version/s: (was: ManifoldCF 2.14) ManifoldCF 2.15 > Documentum Connector users ManifoldCF's local time in queries constraints > against the Documentum server without reference to time zones > --- > > Key: CONNECTORS-1521 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1521 > Project: ManifoldCF > Issue Type: Bug > Components: Documentum connector >Affects Versions: ManifoldCF 2.10 >Reporter: James Thomas >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.15 > > > I find that the time/date constraints in queries to the Documentum server are > based on the "raw" local time of the ManifoldCF server but appear to take no > account of the time zones of the two servers. > This can lead to recently modified files not being transferred to the output > repository when you would naturally expect them to be. I'd like the times to > be aligned, perhaps by including time zone in the query. In particular, is > there a way to use UTC perhaps? > Here's an example ... > * create a folder in Documentum > * set up a job to point at the folder and output to the file system > * put two documents into a folder in Documentum > * Select them, right click and export as CSV (to show the timestamps): > {noformat} > 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM, > 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat} > Check the local time on the ManifoldCF server machine. Observe that it's > reporting consistent time with the DM server: > {noformat} > [james@manifold]$ date > Tue Aug 7 09:07:25 BST 2018{noformat} > Start the job and look for the query to Documentum in the manifoldcf.log file > (line break added for readability): > {noformat} > DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute > query= (select for READ distinct i_chronicle_id from dm_document where > r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and > r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') > AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND > r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) )) > ^C{noformat} > Notice that the latest date asked for is *before* the modification date of > the files added to DM. (And is an hour out, see footnote.) > > See whether anything has been output by the File System connector. It hasn't: > {noformat} > [james@manifold]$ ls /bigdisc/source/PDFs/timezones/ > [james@manifold]$ > {noformat} > Now: > * change the timezone on the ManifoldCF server machine > * restart the ManifoldCF server and the Documentum processes > * reseed the job > Check the local time on the ManifoldCF server machine; it has changed: > {noformat} > [james@manifold]$ date > Tue Aug 7 10:10:29 CEST 2018{noformat} > Start the job again and notice that the query has changed by an hour, plus > the few minutes it took to change the date etc (and is still an hour out, see > footnote): > {noformat} > r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') > {noformat} > Observe that the range of dates now covers the timestamps on the DM data, and > also that some data has now been transferred by the File System connector: > {noformat} > [james@manifold]$ ls > /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/ > drl?versionLabel=CURRENT=09018000e515 > drl?versionLabel=CURRENT=09018000e516 > {noformat} > > > [Footnote] It appears that something is trying to take account of Daylight > Saving Time too. > If I set the server date to a time outside of DST, the query is aligned with > the current time: > {noformat} > [i2e@i2ehost manifold]$ date > Mon Oct 29 00:01:13 CET 2018 > r_modify_date<=date('10/29/2018 00:01:39','mm/dd/ hh:mi:ss') > {noformat} > But if I set the time inside DST, the time is an hour before: > {noformat} > [i2e@i2ehost manifold]$ date > Sat Oct 27 00:00:06 CEST 2018 > r_modify_date<=date('10/26/2018 23:00:26','mm/dd/ hh:mi:ss') > {noformat} > This is perhaps a Java issue rather than a logic issue in the connector? See > e.g. [https://stackoverflow.com/questions/6392/java-time-zone-is-messed-up] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (CONNECTORS-1622) Upgrade to Tika 1.22
[ https://issues.apache.org/jira/browse/CONNECTORS-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cihad Guzel updated CONNECTORS-1622: Summary: Upgrade to Tika 1.22 (was: Upgrade to Tika 1.22 when available) > Upgrade to Tika 1.22 > > > Key: CONNECTORS-1622 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1622 > Project: ManifoldCF > Issue Type: Improvement > Components: Tika extractor >Affects Versions: ManifoldCF 2.13 >Reporter: Cihad Guzel >Priority: Major > Fix For: ManifoldCF next > > > Tika has released 1.22. Changes can be found from here: > http://tika.apache.org/1.22/ -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (CONNECTORS-1622) Upgrade to Tika 1.22 when available
Cihad Guzel created CONNECTORS-1622: --- Summary: Upgrade to Tika 1.22 when available Key: CONNECTORS-1622 URL: https://issues.apache.org/jira/browse/CONNECTORS-1622 Project: ManifoldCF Issue Type: Improvement Components: Tika extractor Affects Versions: ManifoldCF 2.13 Reporter: Cihad Guzel Fix For: ManifoldCF next Tika has released 1.22. Changes can be found from here: http://tika.apache.org/1.22/ -- This message was sent by Atlassian Jira (v8.3.2#803003)
Re: Manifold CF-Non existent of URL
Hi, You aren't giving me enough information to know why your job isn't rechecking URLs. Please tell me how your job is configured, specifically whether it's continuous or not. Thanks. Karl On Mon, Sep 2, 2019 at 4:47 AM Priya Arora wrote: > Hi, > > I have a query regarding manifoldCF. Is this having some kind of > functionality to check, if the URL it is crawling, does exist actually or > page not found(404). > > Like I have a requirement in which i am crawling data for university and > job i continuously running.After some period it found that the certain > URL's have been removed from University site but its is getting indexed > still also. > > Some pages have been marked as status 404. > How can manifold be automatise to check this , that if the URL is > corresponding to 404(does not exist anymore), it should be indexed > > Thanks > Priya. >
Manifold CF-Non existent of URL
Hi, I have a query regarding manifoldCF. Is this having some kind of functionality to check, if the URL it is crawling, does exist actually or page not found(404). Like I have a requirement in which i am crawling data for university and job i continuously running.After some period it found that the certain URL's have been removed from University site but its is getting indexed still also. Some pages have been marked as status 404. How can manifold be automatise to check this , that if the URL is corresponding to 404(does not exist anymore), it should be indexed Thanks Priya.