[jira] [Commented] (CONNECTORS-1588) Custom Jcifs Properties
[ https://issues.apache.org/jira/browse/CONNECTORS-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780473#comment-16780473 ] Karl Wright commented on CONNECTORS-1588: - Patch looks fine. I'll commit it. > Custom Jcifs Properties > --- > > Key: CONNECTORS-1588 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1588 > Project: ManifoldCF > Issue Type: Improvement > Components: JCIFS connector >Affects Versions: ManifoldCF 2.12 >Reporter: Cihad Guzel >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: CONNECTORS-1588 > > > In some cases, "jcifs" is running slowly. In order to solve this problem, we > need to set custom some properties. > > For example; my problem was in my test environment: I have a windows server > and an ubuntu server in same network in AWS EC2 Service. The windows server > has Active Directory service, DNS Server and shared folder while the ubuntu > server has some instance such as manifoldcf, an db instance and solr. > > If the DNS settings are not defined on the ubuntu server, jcifs runs slowly. > Because the default resolver order is set as 'LMHOSTS,DNS,WINS'. It means[1] > ; firstly "jcifs" checks '/etc/hosts' files for linux/unix server'', then it > checks the DNS server. In my opinion, the linux server doesn't recognize the > DNS server and threads are waiting for every file for access to read. > > I suppose, WINS is used when accessing hosts on different subnets. So, I > have set "jcifs.resolveOrder = WINS" and my problem has been FIXED. > > Another suggestion for similar problem from [another > example|https://stackoverflow.com/a/18837754] : "-Djcifs.resolveOrder = DNS" > > We need to set custom resolveOrder variable. > ^[1]^ [https://www.jcifs.org/src/docs/resolver.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780373#comment-16780373 ] Karl Wright commented on CONNECTORS-1563: - Hi [~Subasini], The "excluded mime types" that you set are meant to exclude documents *entirely*, so changing that setting has no effect on *how* documents are indexed. You can look at the Simple History report to verify that this is taking place as you desire, because most connectors create a record when they reject a document for any reason. The Web Connector is no exception. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, Manifold and Solr > settings_CustomField.docx, managed-schema, manifold settings.docx, > manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1587) Unable to Crawl Documents Meta data
[ https://issues.apache.org/jira/browse/CONNECTORS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778989#comment-16778989 ] Karl Wright commented on CONNECTORS-1587: - It is simple; the crawler is requesting more metadata columns at one time than your Sharepoint instance is allowed to respond to. This is a SharePoint configuration issue, apparently, although it is one I've never heard of before. It's certainly *not* a SharePoint Connector issue, unless there's some hard-wired Microsoft limit that you are up against. > Unable to Crawl Documents Meta data > --- > > Key: CONNECTORS-1587 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1587 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Major > > I tried to crawl the meta data of Document section. but cannot able to crawl > the data. > I have facing error stating that " The query cannot be completed because the > number of lookup columns it contains exceeds the lookup column threshold > enforced by the administrator." > How can I resolve this issue.Is there any config needs for that. > Please assist the same. > While checking for documentation mentioned the meta data contents as > drop-down type but my connector(Manifold 2.9.1) is different. Is there any > version update is there for this lookup. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1587) Unable to Crawl Documents Meta data
[ https://issues.apache.org/jira/browse/CONNECTORS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1587. - Resolution: Invalid Not a ManifoldCF bug > Unable to Crawl Documents Meta data > --- > > Key: CONNECTORS-1587 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1587 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Major > > I tried to crawl the meta data of Document section. but cannot able to crawl > the data. > I have facing error stating that " The query cannot be completed because the > number of lookup columns it contains exceeds the lookup column threshold > enforced by the administrator." > How can I resolve this issue.Is there any config needs for that. > Please assist the same. > While checking for documentation mentioned the meta data contents as > drop-down type but my connector(Manifold 2.9.1) is different. Is there any > version update is there for this lookup. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778159#comment-16778159 ] Karl Wright commented on CONNECTORS-1564: - [~erlendfg], if ModifiedHttpSolrClient overrides this setting already, then I don't understand why it isn't working, unless the override isn't setting it. Is that the case? If so, then the obvious fix is to just set it there. > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen > Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: custom jcifs properties
Hi Cihad, I am usually not fond of solutions where all connections of a given type are affected by a single environment change. In this case, there is no good way to make the change connection-specific. I would insist, though, that any change to the functioning of the system be backwards-compatible. That means that even if there are no changes to start-options.env, the default behavior is still as it was. I can think of a way to do that: basically, in the static block, you should check to see if the property is already set before setting it (if it isn't). Karl On Mon, Feb 25, 2019 at 2:49 PM Cihad Guzel wrote: > Hi Karl, > > In some cases, "jcifs" is running slowly. In order to solve this problem, > we need to set custom some properties. > > For example; my problem was in my test environment: I have a windows > server and an ubuntu server in same network in AWS EC2 Service. The windows > server has Active Directory service, DNS Server and shared folder while the > ubuntu server has some instance such as manifoldcf, an db instance and > solr. > > If the DNS settings are not defined on the ubuntu server, jcifs runs > slowly. Because the default resolver order is set as 'LMHOSTS,DNS,WINS'. It > means[1]; firstly "jcifs" checks '/etc/hosts' files for linux/unix > server'', then it checks the DNS server. In my opinion, the linux server > doesn't recognize the DNS server and threads are waiting for every file for > access to read. > > I suppose, WINS is used when accessing hosts on different subnets. So, I > have set "jcifs.resolveOrder = WINS" and my problem has been FIXED. > > I suppose, WINS is used when accessing hosts on different subnets. > > Another suggestion for similar problem from another example[2]: > "-Djcifs.resolveOrder = DNS" > > Finally; I suggest these changes: > > Remove the line > (System.setProperty("jcifs.resolveOrder","LMHOSTS,DNS,WINS"); ) from > SharedDriveConnector.java > > Add "-Djcifs.resolveOrder = LMHOSTS,DNS,WINS" to "start-options.env" file. > > If you have been convinced about this, I can create a PR. > > [1] https://www.jcifs.org/src/docs/resolver.html > [2] https://stackoverflow.com/a/18837754 > > Regards, > Cihad Guzel > > Karl Wright , 24 Şub 2019 Paz, 19:20 tarihinde şunu > yazdı: > >> These settings were provided by the developer of jcifs, Michael Allen. >> You have to really understand the protocol well before you should consider >> changing them in any way. >> >> Thanks, >> Karl >> >> >> On Sun, Feb 24, 2019 at 9:53 AM Cihad Guzel wrote: >> >>> Hi, >>> >>> SharedDriveConnector have some hardcoded system properties as follow: >>> >>> static >>> { >>> System.setProperty("jcifs.smb.client.soTimeout","15"); >>> System.setProperty("jcifs.smb.client.responseTimeout","12"); >>> System.setProperty("jcifs.resolveOrder","LMHOSTS,DNS,WINS"); >>> System.setProperty("jcifs.smb.client.listCount","20"); >>> System.setProperty("jcifs.smb.client.dfs.strictView","true"); >>> } >>> >>> How can I override them when to start manifoldcf? >>> >>> It may be better to define these settings in the start-options.env file. >>> >>> Regards, >>> Cihad Guzel >>> >>
Re: custom jcifs properties
These settings were provided by the developer of jcifs, Michael Allen. You have to really understand the protocol well before you should consider changing them in any way. Thanks, Karl On Sun, Feb 24, 2019 at 9:53 AM Cihad Guzel wrote: > Hi, > > SharedDriveConnector have some hardcoded system properties as follow: > > static > { > System.setProperty("jcifs.smb.client.soTimeout","15"); > System.setProperty("jcifs.smb.client.responseTimeout","12"); > System.setProperty("jcifs.resolveOrder","LMHOSTS,DNS,WINS"); > System.setProperty("jcifs.smb.client.listCount","20"); > System.setProperty("jcifs.smb.client.dfs.strictView","true"); > } > > How can I override them when to start manifoldcf? > > It may be better to define these settings in the start-options.env file. > > Regards, > Cihad Guzel >
[jira] [Commented] (CONNECTORS-1587) Unable to Crawl Documents Meta data
[ https://issues.apache.org/jira/browse/CONNECTORS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775088#comment-16775088 ] Karl Wright commented on CONNECTORS-1587: - Can you amend your ticket to tell us what connectors you are using for your job? This ticket is very nearly incomprehensible, and unless it is amended I will close it on that basis. > Unable to Crawl Documents Meta data > --- > > Key: CONNECTORS-1587 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1587 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Major > > I tried to crawl the meta data of Document section. but cannot able to crawl > the data. > I have facing error stating that " The query cannot be completed because the > number of lookup columns it contains exceeds the lookup column threshold > enforced by the administrator." > How can I resolve this issue.Is there any config needs for that. > Please assist the same. > While checking for documentation mentioned the meta data contents as > drop-down type but my connector(Manifold 2.9.1) is different. Is there any > version update is there for this lookup. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: ManifoldCF Website Links
We haven't been updating "latest" since at least 2016. We now only include actual releases. Honestly, how do you tell Google to reindex?? On Fri, Feb 22, 2019 at 3:59 AM Furkan KAMACI wrote: > Hep about adding a path as latest i.e.: > > https://manifoldcf.apache.org/release/latest/en_US/performance-tuning.html > > > 22 Şub 2019 Cum, saat 11:34 tarihinde Karl Wright > şunu > yazdı: > > > Hi Furkan, > > > > I am not sure why Google maintains these dead links but we simply cannot > > publish doc for every release going back to 2012. Generally we cycle > > releases and include the last two for each major release. We include the > > 1.10 docs as well as the 2.12 and 2.11 docs right now. It is > prohibitively > > expensive to include more than that; doing so would make it incrementally > > harder to update the site, and it's already not easy. > > > > Thanks, > > Karl > > > > > > On Fri, Feb 22, 2019 at 2:34 AM Furkan KAMACI > > wrote: > > > > > Hi All, > > > > > > When we search something on ManifoldCF on Google we get results > something > > > like that: > > > > > > > > > > > > https://manifoldcf.apache.org/release/release-2.9.1/en_US/performance-tuning.html > > > > > > However, such links are broken. Can we fix it someway i.e. creating a > > path > > > for latest release? > > > > > > Kind Regards, > > > Furkan KAMACI > > > > > >
Re: ManifoldCF Website Links
Hi Furkan, I am not sure why Google maintains these dead links but we simply cannot publish doc for every release going back to 2012. Generally we cycle releases and include the last two for each major release. We include the 1.10 docs as well as the 2.12 and 2.11 docs right now. It is prohibitively expensive to include more than that; doing so would make it incrementally harder to update the site, and it's already not easy. Thanks, Karl On Fri, Feb 22, 2019 at 2:34 AM Furkan KAMACI wrote: > Hi All, > > When we search something on ManifoldCF on Google we get results something > like that: > > > https://manifoldcf.apache.org/release/release-2.9.1/en_US/performance-tuning.html > > However, such links are broken. Can we fix it someway i.e. creating a path > for latest release? > > Kind Regards, > Furkan KAMACI >
[jira] [Commented] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774309#comment-16774309 ] Karl Wright commented on CONNECTORS-1584: - Have you subscribed to the list? Instructions are in the documentation for "contact us". You send mail to: user-subscr...@manifoldcf.apache.org > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: > {code:java} > .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774105#comment-16774105 ] Karl Wright commented on CONNECTORS-1584: - Actually, it *is* user@ but so many people get mixed up with that that I got it backwards myself. What failure notice did you get when you mailed to user@? I receive email from this list a dozen times a day or more so I am not sure why you'd be having trouble. > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: > {code:java} > .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772995#comment-16772995 ] Karl Wright commented on CONNECTORS-1563: - [~Subasini], we are trying to debug your setup. The first principle of debugging is to identify where exactly the problem is occurring. It eliminates one variable. The file system connector is quite simple and has few configuration options, so it should be easy to set something up we can use to evaluate your solr connection. Thanks, Karl > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, Manifold and Solr > settings_CustomField.docx, managed-schema, manifold settings.docx, > manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772912#comment-16772912 ] Karl Wright commented on CONNECTORS-1563: - [~Subasini], the "error" is because it does not recognize a specific translation bundle for your language, so it defaults to English. It is harmless. I asked you to *try* working with a File System connection initially to narrow down where your problems were coming from. Please do so. [~shinichiro abe] and myself both tried a configuration similar to the one you report end of last year when we were debugging the 2.11 release of ManifoldCF. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, Manifold and Solr > settings_CustomField.docx, managed-schema, manifold settings.docx, > manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772729#comment-16772729 ] Karl Wright commented on CONNECTORS-1563: - In general in cases like this I recommend that people start with the simplest possible working configuration and then modify it until they achieve their goals. In this case that would mean starting with a file system job and a freshly-installed Solr instance, with no other changes whatsoever. [~shinichiro abe], can you help Mr. Rath by trying MCF 2.12 with a fresh single-process Solr instance, using the "/update" handler? He claims that this does not work and I do not have any time to work with him for the next few weeks. If it works for you please provide detailed steps describing what you did. Thanks in advance! > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, Manifold and Solr > settings_CustomField.docx, managed-schema, manifold settings.docx, > manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771663#comment-16771663 ] Karl Wright commented on CONNECTORS-1563: - Hi Subasini, Are you now Tika-extracting in ManifoldCF, or in Solr? The text field looks like it contains properly extracted content, along with other stuff you do not want. Is this correct? If the extraction is happening in Solr, then I have no idea what this is coming from. If the extraction is happening in ManifoldCF, then if you have placed a Metadata Adjuster transformer in the pipeline between the Tika Extractor and the Solr Output Connector, I'd say you had set it up to concatenate many fields together into a text field. The Metadata Adjuster has that ability. The choice of how metadata (or content) fields get mapped to Solr schema is set up in your Solr output connection configuration. The Tika extraction basically replaces a binary input document with a character-sequence output document plus metadata fields. The character-sequence output document then must be sent to Solr not using the exracting update handler, but just the standard handler, so the handler should be changed from /update/extract to just /update, and the "Use extracting update handler" should be turned off. The actual field name used for the extracted content body can also be changed, if desired, in the "Schema" part of the configuration. But what is there by default works with Solr as it's set up by default. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector > Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, managed-schema, manifold > settings.docx, manifoldcf.log, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1584. - Resolution: Not A Problem > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: > {code:java} > .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Error integrity constraint violation
Hi Kaya, Database constraint violations, as you know, occur because you're trying to put more than one identical value into a table column that cannot have such a column. For the table in question, if you have the same class name for two different connectors, this would be what you'd expect. Karl On Sun, Feb 17, 2019 at 11:33 PM Kaya Ota wrote: > Hello, folks: > > I am new to ManifoldCF, and trying to make my own connector. > For now, I could successfully build ManifoldCF including my own connector. > However, when I tried to run, I have exceptions. > > The exception I am facing is : > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: integrity > constraint violation: unique constraint or index violation: I1549774667196 > at > > org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.reinterpretException(DBInterfaceHSQLDB.java:734) > at > > org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performModification(DBInterfaceHSQLDB.java:754) > at > > org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performInsert(DBInterfaceHSQLDB.java:230) > at > > org.apache.manifoldcf.core.database.BaseTable.performInsert(BaseTable.java:68) > at > > org.apache.manifoldcf.crawler.connmgr.ConnectorManager.registerConnector(ConnectorManager.java:172) > at > > org.apache.manifoldcf.crawler.system.ManifoldCF.registerConnectors(ManifoldCF.java:672) > at > > org.apache.manifoldcf.crawler.system.ManifoldCF.reregisterAllConnectors(ManifoldCF.java:160) > at > > org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:239) > Caused by: java.sql.SQLIntegrityConstraintViolationException: integrity > constraint violation: unique constraint or index violation: I1549774667196 > at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) > at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) > at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown > Source) > at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown > Source) > at > org.apache.manifoldcf.core.database.Database.execute(Database.java:916) > at > > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:696) > Caused by: org.hsqldb.HsqlException: integrity constraint violation: unique > constraint or index violation: I1549774667196 > at org.hsqldb.error.Error.error(Unknown Source) > at org.hsqldb.error.Error.error(Unknown Source) > at org.hsqldb.index.IndexAVL.insert(Unknown Source) > at org.hsqldb.persist.RowStoreAVL.indexRow(Unknown Source) > at org.hsqldb.persist.RowStoreAVLDisk.indexRow(Unknown Source) > at org.hsqldb.TransactionManagerMVCC.addInsertAction(Unknown > Source) > at org.hsqldb.Session.addInsertAction(Unknown Source) > at org.hsqldb.Table.insertSingleRow(Unknown Source) > at org.hsqldb.StatementDML.insertSingleRow(Unknown Source) > at org.hsqldb.StatementInsert.getResult(Unknown Source) > at org.hsqldb.StatementDMQL.execute(Unknown Source) > at org.hsqldb.Session.executeCompiledStatement(Unknown Source) > at org.hsqldb.Session.execute(Unknown Source) > ... 4 more > > > I am guessing my class-path would have a problem, but do not have a > confidence. > What is the cause of this error? > > I would appreciate for any of your help. > > > Sincerely, > Kaya >
[jira] [Commented] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771462#comment-16771462 ] Karl Wright commented on CONNECTORS-1584: - The mailing list is us...@manifoldcf.apache.org. The regular expressions are standard Java regular expressions. The documentation is widely available. You can also experiment with regular expressions in a java applet online at: https://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: > {code:java} > .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1585) MCF Admin page shows 404 error frequently
[ https://issues.apache.org/jira/browse/CONNECTORS-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1585. - Resolution: Cannot Reproduce > MCF Admin page shows 404 error frequently > - > > Key: CONNECTORS-1585 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1585 > Project: ManifoldCF > Issue Type: Task >Reporter: Pavithra Dhakshinamurthy >Priority: Critical > > Hi Team, > I'm getting 404 Page not found error on a frequent basis in Manifold CF home > page. Not able to trace any error logs as well. Please let me know on what > scenarios 404 error will occur. > http://{hostname}:8345/mcf-crawler-ui/login.jsp > Regards, > Pavithra D -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1585) MCF Admin page shows 404 error frequently
[ https://issues.apache.org/jira/browse/CONNECTORS-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771461#comment-16771461 ] Karl Wright commented on CONNECTORS-1585: - 404 errors have nothing to do with ManifoldCF. They have to do with your app server environment -- either that, or your network/proxy. MCF is just a web app and does not have any magic in it. > MCF Admin page shows 404 error frequently > - > > Key: CONNECTORS-1585 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1585 > Project: ManifoldCF > Issue Type: Task >Reporter: Pavithra Dhakshinamurthy >Priority: Critical > > Hi Team, > I'm getting 404 Page not found error on a frequent basis in Manifold CF home > page. Not able to trace any error logs as well. Please let me know on what > scenarios 404 error will occur. > http://{hostname}:8345/mcf-crawler-ui/login.jsp > Regards, > Pavithra D -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1580) Issues in documentum connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1580. - Resolution: Won't Fix > Issues in documentum connector > -- > > Key: CONNECTORS-1580 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1580 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Blocker > Attachments: Job_Scheduling.png > > > Hi Team, > We are facing below issues in apache manifold documentum connector version > 2.9.1.kindly help us. > 1.During the first run of the job,documents are getting indexed to > ElasticSearch.If the same job is run after the completion,records are getting > seeded,processed but not updated to output connector.Once the document id is > indexed,same document id is not able to update it again in the same job. > > 2.We have scheduled incremental crawling for every 15 mins and document > count will vary for every 15 mins. But in seeding it is not resetting the > document count,once the job is completed.It's getting added to last scheduled > job count. >eg.1st schedule-10 documents > 2nd schedule-5 documents > In the 2nd scheduled of the job,the document count should be 5,but it is > having document count as 15. so it is keep on adding the dcouments id for > every schedule and it is processing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1580) Issues in documentum connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766799#comment-16766799 ] Karl Wright commented on CONNECTORS-1580: - You are on your own here. You are trying to use it as a queuing engine, not an incremental indexer. You have not thought this out properly, clearly, because that's not what addSeedDocuments() does. So you must come up with a version string computation that reflects the fact that your documents have changed and need to be reconsidered. It will have to directly reference whatever external queue you are using to stuff changed documents in. You should maybe start by reading the book. It's free. Here: https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs > Issues in documentum connector > -- > > Key: CONNECTORS-1580 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1580 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Blocker > Attachments: Job_Scheduling.png > > > Hi Team, > We are facing below issues in apache manifold documentum connector version > 2.9.1.kindly help us. > 1.During the first run of the job,documents are getting indexed to > ElasticSearch.If the same job is run after the completion,records are getting > seeded,processed but not updated to output connector.Once the document id is > indexed,same document id is not able to update it again in the same job. > > 2.We have scheduled incremental crawling for every 15 mins and document > count will vary for every 15 mins. But in seeding it is not resetting the > document count,once the job is completed.It's getting added to last scheduled > job count. >eg.1st schedule-10 documents > 2nd schedule-5 documents > In the 2nd scheduled of the job,the document count should be 5,but it is > having document count as 15. so it is keep on adding the dcouments id for > every schedule and it is processing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Apache ManifoldCF: Get a history report for a repository connection over REST API
Yes, query parameters in any URL go after the fixed "path" part of the URL, and are of the form ?parameter=value=value2... just like any other URL. My suspicion is that you aren't supplying the activity(s) that you want to match. The best way to figure out what activities make sense for the connection is to look at the report UI page for that connection and see what activities are available. Also please be aware that by default MCF purges history records that are more than 30 days old -- you can configure longer, but if they don't show up in the UI they aren't going to show up in the API. Finally, the reason you don't get an error when you use a connection name that is bogus is because the underlying implementation is merely doing a dumb query and not checking for the legality/existence of the connection name you give it. Thanks, and please let me know how it goes. Karl On Tue, Feb 12, 2019 at 7:43 PM Dave Fisher wrote: > Redirecting your query to dev@manifoldcf.apache.org > > Sent from my iPhone > > > On Feb 12, 2019, at 8:38 AM, Marta Gołąbek wrote: > > > > Dear Sir or Madam, > > > > I'm trying to get a history report for a repository connection over > > ManifoldCF REST API. According to the documentation: > > > > > https://manifoldcf.apache.org/release/release-2.11/en_US/programmatic-operation.html#History+query+parameters > > > > It should be possible with the following URL (connection name: > > myConnection): > > > > > http://localhost:8345/mcf-api-service/json/repositoryconnectionhistory/myConnection > > > > I have also tried to use some of the history query parameters: > > > > > http://localhost:8345/mcf-api-service/json/repositoryconnectionhistory/myConnection?report=simple > > > > But I am not sure if I am using them correctly or how they should be > > attached to the URL, because it is not mentioned in the documentation. > The > > problem is also that I don't receive any error, but an empty object, so > it > > is difficult to debug. The API returns an empty object even for a > > non-existing connection. > > > > However it works for resources, which doesn't have any attributes, e.g.: > > > > > http://localhost:8345/mcf-api-service/json/repositoryconnectionjobs/myConnection > > > > or > > > > > http://localhost:8345/mcf-api-service/json/repositoryconnections/myConnection > > > > Thanks in advace for any help. > > > > Best regards, > > > > Marta Golabek > >
[jira] [Commented] (CONNECTORS-1581) [Set priority thread] Error tossed: null during startup
[ https://issues.apache.org/jira/browse/CONNECTORS-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766037#comment-16766037 ] Karl Wright commented on CONNECTORS-1581: - It's possible that the problem is due to funkiness in MySQL. We've had a lot of trouble lately because MySQL no longer seems to be properly enforcing transaction integrity in at least some circumstances. OR it could be the open-source MySQL driver we're using; maybe that needs an upgrade? At any rate, removal of jobqueue rows MUST precede removal of job table rows; there's a constraint in place in fact. So if you get to the point where that constraint has been violated, you're pretty certain it's a database issue. :-( > [Set priority thread] Error tossed: null during startup > --- > > Key: CONNECTORS-1581 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1581 > Project: ManifoldCF > Issue Type: Bug > Environment: •ManifoldCF 2.12, running in a Docker Container > based on Redhat Linux, OpenJDK 8 > • AWS RDS Database (Aurora MySQL -> 5.6 compatible, utf8 (collation > utf8_bin)) > • Single Process Setup >Reporter: Markus Schuch >Assignee: Markus Schuch >Priority: Major > > We see the following {{NullPointerException}} at startup: > {code} > [Set priority thread] FATAL org.apache.manifoldcf.crawlerthreads- Error > tossed: null > java.lang.NullPointerException > at > org.apache.manifoldcf.crawler.system.ManifoldCF.writeDocumentPriorities(ManifoldCF.java:1202) > at > org.apache.manifoldcf.crawler.system.SetPriorityThread.run(SetPriorityThread.java:141) > {code} > What could be the cause of that? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1582) Unable to Crawl the Site Contents and Meta-Data
[ https://issues.apache.org/jira/browse/CONNECTORS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765959#comment-16765959 ] Karl Wright commented on CONNECTORS-1582: - The purpose is to decide whether the document matches the specified inclusion rules for the document. > Unable to Crawl the Site Contents and Meta-Data > --- > > Key: CONNECTORS-1582 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1582 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy > Assignee: Karl Wright >Priority: Major > > Hi, > Currently I'm using the ManifoldCF(2.9.1) SharePoint version 2003. I'm unable > to crawl the site contents data. I have facing some issues, hard to figure > out to resolve. > can you please assist the same. > There is a method(CheckMatch) for validating ASCII value for site contests > but unable to understand the usage of validation. I'm getting error "no > matching rule" because of failing the rule of CheckMatch(). > Even-though i tried path type as Library, List, Site, Folder but unable to > crawl the site contents and meta data. while putting logger i can able to see > the list of site contents > Thanks, -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1583) ManifoldCF getting hung frequently
[ https://issues.apache.org/jira/browse/CONNECTORS-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765760#comment-16765760 ] Karl Wright commented on CONNECTORS-1583: - How have you deployed ManifoldCF? What app server are you using? What deployment model (e.g. which example)? The ManifoldCF UI runs underneath an application server. It appears to me like that application server is either inaccessible or has been shut down. This is not a ManifoldCF problem. > ManifoldCF getting hung frequently > -- > > Key: CONNECTORS-1583 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1583 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.9.1 >Reporter: Pavithra Dhakshinamurthy >Priority: Major > Attachments: image-2019-02-12-11-59-52-131.png > > > Hi Team, > We are using Manifold 2.9.1 version for crawling the documents. The > ManifoldCF server is getting hung very frequently due to which crawling is > getting failed. > While accessing the Manifold application, it's throwing 404 error, but we > could see the process running at the background. > !image-2019-02-12-11-59-52-131.png|thumbnail! > Connectors used: > Repository :Documentum > Output : Elasticsearch > Kindly help us in resolving this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1583) ManifoldCF getting hung frequently
[ https://issues.apache.org/jira/browse/CONNECTORS-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1583. - Resolution: Incomplete > ManifoldCF getting hung frequently > -- > > Key: CONNECTORS-1583 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1583 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.9.1 >Reporter: Pavithra Dhakshinamurthy >Priority: Major > Attachments: image-2019-02-12-11-59-52-131.png > > > Hi Team, > We are using Manifold 2.9.1 version for crawling the documents. The > ManifoldCF server is getting hung very frequently due to which crawling is > getting failed. > While accessing the Manifold application, it's throwing 404 error, but we > could see the process running at the background. > !image-2019-02-12-11-59-52-131.png|thumbnail! > Connectors used: > Repository :Documentum > Output : Elasticsearch > Kindly help us in resolving this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1581) [Set priority thread] Error tossed: null during startup
[ https://issues.apache.org/jira/browse/CONNECTORS-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765178#comment-16765178 ] Karl Wright commented on CONNECTORS-1581: - Yes if the job ID doesn't show up anywhere it's safe to delete. How did you wind up in that situation though? Karl On Mon, Feb 11, 2019 at 12:15 PM Markus Schuch (JIRA) > [Set priority thread] Error tossed: null during startup > --- > > Key: CONNECTORS-1581 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1581 > Project: ManifoldCF > Issue Type: Bug > Environment: •ManifoldCF 2.12, running in a Docker Container > based on Redhat Linux, OpenJDK 8 > • AWS RDS Database (Aurora MySQL -> 5.6 compatible, utf8 (collation > utf8_bin)) > • Single Process Setup >Reporter: Markus Schuch >Assignee: Markus Schuch >Priority: Major > > We see the following {{NullPointerException}} at startup: > {code} > [Set priority thread] FATAL org.apache.manifoldcf.crawlerthreads- Error > tossed: null > java.lang.NullPointerException > at > org.apache.manifoldcf.crawler.system.ManifoldCF.writeDocumentPriorities(ManifoldCF.java:1202) > at > org.apache.manifoldcf.crawler.system.SetPriorityThread.run(SetPriorityThread.java:141) > {code} > What could be the cause of that? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1580) Issues in documentum connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765083#comment-16765083 ] Karl Wright commented on CONNECTORS-1580: - So you modified the Documentum Connector to change what addSeedDocument returns? Did you change what getModel() returns? Did you change how the version string is calculated in processDocuments()? If you don't do that the framework will not detect changes and will not work properly. > Issues in documentum connector > -- > > Key: CONNECTORS-1580 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1580 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Blocker > Attachments: Job_Scheduling.png > > > Hi Team, > We are facing below issues in apache manifold documentum connector version > 2.9.1.kindly help us. > 1.During the first run of the job,documents are getting indexed to > ElasticSearch.If the same job is run after the completion,records are getting > seeded,processed but not updated to output connector.Once the document id is > indexed,same document id is not able to update it again in the same job. > > 2.We have scheduled incremental crawling for every 15 mins and document > count will vary for every 15 mins. But in seeding it is not resetting the > document count,once the job is completed.It's getting added to last scheduled > job count. >eg.1st schedule-10 documents > 2nd schedule-5 documents > In the 2nd scheduled of the job,the document count should be 5,but it is > having document count as 15. so it is keep on adding the dcouments id for > every schedule and it is processing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1582) Unable to Crawl the Site Contents and Meta-Data
[ https://issues.apache.org/jira/browse/CONNECTORS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1582: --- Assignee: Karl Wright > Unable to Crawl the Site Contents and Meta-Data > --- > > Key: CONNECTORS-1582 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1582 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy > Assignee: Karl Wright >Priority: Major > > Hi, > Currently I'm using the ManifoldCF(2.9.1) SharePoint version 2003. I'm unable > to crawl the site contents data. I have facing some issues, hard to figure > out to resolve. > can you please assist the same. > There is a method(CheckMatch) for validating ASCII value for site contests > but unable to understand the usage of validation. I'm getting error "no > matching rule" because of failing the rule of CheckMatch(). > Even-though i tried path type as Library, List, Site, Folder but unable to > crawl the site contents and meta data. while putting logger i can able to see > the list of site contents > Thanks, -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1582) Unable to Crawl the Site Contents and Meta-Data
[ https://issues.apache.org/jira/browse/CONNECTORS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1582. - Resolution: Not A Problem > Unable to Crawl the Site Contents and Meta-Data > --- > > Key: CONNECTORS-1582 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1582 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy > Assignee: Karl Wright >Priority: Major > > Hi, > Currently I'm using the ManifoldCF(2.9.1) SharePoint version 2003. I'm unable > to crawl the site contents data. I have facing some issues, hard to figure > out to resolve. > can you please assist the same. > There is a method(CheckMatch) for validating ASCII value for site contests > but unable to understand the usage of validation. I'm getting error "no > matching rule" because of failing the rule of CheckMatch(). > Even-though i tried path type as Library, List, Site, Folder but unable to > crawl the site contents and meta data. while putting logger i can able to see > the list of site contents > Thanks, -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1582) Unable to Crawl the Site Contents and Meta-Data
[ https://issues.apache.org/jira/browse/CONNECTORS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765019#comment-16765019 ] Karl Wright commented on CONNECTORS-1582: - Hi [~Pavithrad], the problem is that you will need not just one rule, but a rule for sites, and a rule for libraries, and a rule for documents. So if the entity you need to decide whether it is included is a site, then you need a site rule, and the same for libraries or documents. And since you can't get to all document metadata without drilling down through sites and libraries, you need the rules for these in order to get to the metadata for each of these levels. The documentation is pretty clear about how these rules work, but I agree that the interface is complex to work with. > Unable to Crawl the Site Contents and Meta-Data > --- > > Key: CONNECTORS-1582 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1582 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Major > > Hi, > Currently I'm using the ManifoldCF(2.9.1) SharePoint version 2003. I'm unable > to crawl the site contents data. I have facing some issues, hard to figure > out to resolve. > can you please assist the same. > There is a method(CheckMatch) for validating ASCII value for site contests > but unable to understand the usage of validation. I'm getting error "no > matching rule" because of failing the rule of CheckMatch(). > Even-though i tried path type as Library, List, Site, Folder but unable to > crawl the site contents and meta data. while putting logger i can able to see > the list of site contents > Thanks, -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1580) Issues in documentum connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763725#comment-16763725 ] Karl Wright commented on CONNECTORS-1580: - {quote} The documents which have already got indexed are getting processed but not getting updated to Elasticsearch while re-running the same job {quote} What does the Simple History say here? Look for a document that you think should be updated but isn't getting updated. Do you see a document fetch? Do you see a document ingestion? If you see an ingestion BUT the ES index is not getting updated then your problem has to do with how ES is set up. I can imagine quite a few scenarios where that can occur. If you are seeing a fetch but no indexing, that means that the version string for your documentum documents is not changing for some reason. This would require more analysis, starting with learning exactly what has changed with the document in question that you expect should cause a reindex. It is possible you have some custom information that is not showing up in the version string and you are nonetheless expecting it to. We would need more details to be able to fix that. > Issues in documentum connector > -- > > Key: CONNECTORS-1580 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1580 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Blocker > Attachments: Job_Scheduling.png > > > Hi Team, > We are facing below issues in apache manifold documentum connector version > 2.9.1.kindly help us. > 1.During the first run of the job,documents are getting indexed to > ElasticSearch.If the same job is run after the completion,records are getting > seeded,processed but not updated to output connector.Once the document id is > indexed,same document id is not able to update it again in the same job. > > 2.We have scheduled incremental crawling for every 15 mins and document > count will vary for every 15 mins. But in seeding it is not resetting the > document count,once the job is completed.It's getting added to last scheduled > job count. >eg.1st schedule-10 documents > 2nd schedule-5 documents > In the 2nd scheduled of the job,the document count should be 5,but it is > having document count as 15. so it is keep on adding the dcouments id for > every schedule and it is processing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1581) [Set priority thread] Error tossed: null during startup
[ https://issues.apache.org/jira/browse/CONNECTORS-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763667#comment-16763667 ] Karl Wright commented on CONNECTORS-1581: - I am pretty concerned that the database layer is fundamentally not working right. The fact that the set priority thread recovers argues that there is a database failure that silently resolves. This is bizarre. If the thread actually finally starts, then you should be good to go other than the concerns expressed above. > [Set priority thread] Error tossed: null during startup > --- > > Key: CONNECTORS-1581 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1581 > Project: ManifoldCF > Issue Type: Bug > Environment: •ManifoldCF 2.12, running in a Docker Container > based on Redhat Linux, OpenJDK 8 > • AWS RDS Database (Aurora MySQL -> 5.6 compatible, utf8 (collation > utf8_bin)) > • Single Process Setup >Reporter: Markus Schuch >Priority: Major > > We see the following {{NullPointerException}} at startup: > {code} > [Set priority thread] FATAL org.apache.manifoldcf.crawlerthreads- Error > tossed: null > java.lang.NullPointerException > at > org.apache.manifoldcf.crawler.system.ManifoldCF.writeDocumentPriorities(ManifoldCF.java:1202) > at > org.apache.manifoldcf.crawler.system.SetPriorityThread.run(SetPriorityThread.java:141) > {code} > What could be the cause of that? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1581) [Set priority thread] Error tossed: null during startup
[ https://issues.apache.org/jira/browse/CONNECTORS-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763530#comment-16763530 ] Karl Wright commented on CONNECTORS-1581: - Here's the code that's throwing an NPE: {code} // Compute the list of connector instances we will need. // This has a side effect of fetching all job descriptions too. Set connectionNames = new HashSet(); for (int i = 0; i < descs.length; i++) { DocumentDescription dd = descs[i]; IJobDescription job = jobDescriptionMap.get(dd.getJobID()); if (job == null) { job = jobManager.load(dd.getJobID(),true); jobDescriptionMap.put(dd.getJobID(),job); } connectionNames.add(job.getConnectionName()); } {code} The problem is, apparently, that jobManager.load() is coming back null. I have no idea why this would happen but clearly the problem has to do with the database implementation, perhaps the mysql driver being used? > [Set priority thread] Error tossed: null during startup > --- > > Key: CONNECTORS-1581 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1581 > Project: ManifoldCF > Issue Type: Bug > Environment: •ManifoldCF 2.12, running in a Docker Container > based on Redhat Linux, OpenJDK 8 > • AWS RDS Database (Aurora MySQL -> 5.6 compatible, utf8 (collation > utf8_bin)) > • Single Process Setup >Reporter: Markus Schuch >Priority: Major > > We see the following {{NullPointerException}} at startup: > {code} > [Set priority thread] FATAL org.apache.manifoldcf.crawlerthreads- Error > tossed: null > java.lang.NullPointerException > at > org.apache.manifoldcf.crawler.system.ManifoldCF.writeDocumentPriorities(ManifoldCF.java:1202) > at > org.apache.manifoldcf.crawler.system.SetPriorityThread.run(SetPriorityThread.java:141) > {code} > What could be the cause of that? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1580) Issues in documentum connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763404#comment-16763404 ] Karl Wright commented on CONNECTORS-1580: - Hi, I can make almost no sense of this ticket. Can you describe the job scheduling setup? Specifically is this "scan once" or "rescan dynamically"? What does this mean exactly? "We have scheduled incremental crawling for every 15 mins" You should be aware that the document count will vary because documents that are discovered are then processed and ManifoldCF may determine during processing that the document does not need to be indexed. The best way to figure out what MCF is doing is to look at the Simple History report and see what is happening. You can see what is fetched and what is reindexed that way. Can you include the Simple History for one incremental job run here, and describe what is wrong with it? > Issues in documentum connector > -- > > Key: CONNECTORS-1580 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1580 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Blocker > > Hi Team, > We are facing below issues in apache manifold documentum connector version > 2.9.1.kindly help us. > 1.During the first run of the job,documents are getting indexed to > ElasticSearch.If the same job is run after the completion,records are getting > seeded,processed but not updated to output connector.Once the document id is > indexed,same document id is not able to update it again in the same job. > > 2.We have scheduled incremental crawling for every 15 mins and document > count will vary for every 15 mins. But in seeding it is not resetting the > document count,once the job is completed.It's getting added to last scheduled > job count. >eg.1st schedule-10 documents > 2nd schedule-5 documents > In the 2nd scheduled of the job,the document count should be 5,but it is > having document count as 15. so it is keep on adding the dcouments id for > every schedule and it is processing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763395#comment-16763395 ] Karl Wright commented on CONNECTORS-1579: - You can either check out the entire current trunk source code and build that, or download the release source and libs, apply the patch, and build that. Which do you want to do? > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: 636_bb2.csv, CONNECTORS-1579.patch > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed: document '636' > java.lang.IllegalStateException: Multiple document primary component > dispositions not allowed: document '636' > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?]{noformat} > I looked this error up on the internet and it said that it might have > something to do with using the same key for different lines. > I checked, but I couldn't find any duplicates that match any of the selected > fields in the JDBC. > Hereby my queries: > Seeding query > {code:java} > SELECT pk1 as $(IDCOLUMN) > FROM dbo.bb2 > WHERE search_url IS NOT NULL > AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', > 'application/xml', 'application/zip'); > {code} > Version check query: none > Access token query: none > Data query: > > > {code:java} > SELECT > pk1 AS $(IDCOLUMN), > search_url AS $(URLCOLUMN), > ISNULL(content, '') AS $(DATACOLUMN), > doc_id, > search_url AS url, > ISNULL(title, '') as title, > ISNULL(groups,'') as groups, > ISNULL(type,'') as document_type, > ISNULL(users, '') as users > FROM dbo.bb2 > WHERE pk1 IN $(IDLIST); > {code} > The hereby added csv is the corresponding line from the table. > [^636_bb2.csv] > > Due to this problem, the whole crawling pipeline is being held up. It keeps > on retrying this line. > Could you help me understand this error? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1579. - Resolution: Fixed Fix Version/s: ManifoldCF 2.13 r1853008 > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: 636_bb2.csv, CONNECTORS-1579.patch > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed: document '636' > java.lang.IllegalStateException: Multiple document primary component > dispositions not allowed: document '636' > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?]{noformat} > I looked this error up on the internet and it said that it might have > something to do with using the same key for different lines. > I checked, but I couldn't find any duplicates that match any of the selected > fields in the JDBC. > Hereby my queries: > Seeding query > {code:java} > SELECT pk1 as $(IDCOLUMN) > FROM dbo.bb2 > WHERE search_url IS NOT NULL > AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', > 'application/xml', 'application/zip'); > {code} > Version check query: none > Access token query: none > Data query: > > > {code:java} > SELECT > pk1 AS $(IDCOLUMN), > search_url AS $(URLCOLUMN), > ISNULL(content, '') AS $(DATACOLUMN), > doc_id, > search_url AS url, > ISNULL(title, '') as title, > ISNULL(groups,'') as groups, > ISNULL(type,'') as document_type, > ISNULL(users, '') as users > FROM dbo.bb2 > WHERE pk1 IN $(IDLIST); > {code} > The hereby added csv is the corresponding line from the table. > [^636_bb2.csv] > > Due to this problem, the whole crawling pipeline is being held up. It keeps > on retrying this line. > Could you help me understand this error? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1579: Attachment: CONNECTORS-1579.patch > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: 636_bb2.csv, CONNECTORS-1579.patch > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed: document '636' > java.lang.IllegalStateException: Multiple document primary component > dispositions not allowed: document '636' > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?]{noformat} > I looked this error up on the internet and it said that it might have > something to do with using the same key for different lines. > I checked, but I couldn't find any duplicates that match any of the selected > fields in the JDBC. > Hereby my queries: > Seeding query > {code:java} > SELECT pk1 as $(IDCOLUMN) > FROM dbo.bb2 > WHERE search_url IS NOT NULL > AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', > 'application/xml', 'application/zip'); > {code} > Version check query: none > Access token query: none > Data query: > > > {code:java} > SELECT > pk1 AS $(IDCOLUMN), > search_url AS $(URLCOLUMN), > ISNULL(content, '') AS $(DATACOLUMN), > doc_id, > search_url AS url, > ISNULL(title, '') as title, > ISNULL(groups,'') as groups, > ISNULL(type,'') as document_type, > ISNULL(users, '') as users > FROM dbo.bb2 > WHERE pk1 IN $(IDLIST); > {code} > The hereby added csv is the corresponding line from the table. > [^636_bb2.csv] > > Due to this problem, the whole crawling pipeline is being held up. It keeps > on retrying this line. > Could you help me understand this error? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760848#comment-16760848 ] Karl Wright commented on CONNECTORS-1579: - It's a bug in the code. Whenever the JDBC connector rejects a document based on what the downstream pipeline tells it to do, it improperly accounts for that and you get this error. The fix is quite simple and I can attach a patch, and will do so shortly. Thanks! > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: 636_bb2.csv > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed: document '636' > java.lang.IllegalStateException: Multiple document primary component > dispositions not allowed: document '636' > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?]{noformat} > I looked this error up on the internet and it said that it might have > something to do with using the same key for different lines. > I checked, but I couldn't find any duplicates that match any of the selected > fields in the JDBC. > Hereby my queries: > Seeding query > {code:java} > SELECT pk1 as $(IDCOLUMN) > FROM dbo.bb2 > WHERE search_url IS NOT NULL > AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', > 'application/xml', 'application/zip'); > {code} > Version check query: none > Access token query: none > Data query: > > > {code:java} > SELECT > pk1 AS $(IDCOLUMN), > search_url AS $(URLCOLUMN), > ISNULL(content, '') AS $(DATACOLUMN), > doc_id, > search_url AS url, > ISNULL(title, '') as title, > ISNULL(groups,'') as groups, > ISNULL(type,'') as document_type, > ISNULL(users, '') as users > FROM dbo.bb2 > WHERE pk1 IN $(IDLIST); > {code} > The hereby added csv is the corresponding line from the table. > [^636_bb2.csv] > > Due to this problem, the whole crawling pipeline is being held up. It keeps > on retrying this line. > Could you help me understand this error? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760818#comment-16760818 ] Karl Wright commented on CONNECTORS-1579: - Hi, The proximate cause of the problem is that there are multiple "resolutions" occurring for one document in the JDBC crawl set. When a connector is asked to process a document, it must tell the framework what is to be done with it -- either it gets indexed, or it gets skipped, or it gets deleted. The problem is that the connector is telling the framework TWO things for the same document. The code in question: {code} // Now, go through the original id's, and see which ones are still in the map. These // did not appear in the result and are presumed to be gone from the database, and thus must be deleted. for (final String documentIdentifier : fetchDocuments) { if (!seenDocuments.contains(documentIdentifier)) { // Never saw it in the fetch attempt activities.deleteDocument(documentIdentifier); } else { // Saw it in the fetch attempt, and we might have fetched it final String documentVersion = map.get(documentIdentifier); if (documentVersion != null) { // This means we did not see it (or data for it) in the result set. Delete it! activities.noDocument(documentIdentifier,documentVersion); {code} It's failing on the last line. The connector thinks there is in fact no document that exists (based on the version query you gave it), BUT based on the results of the other queries, it thinks the document does exist (and was in fact processed). I will need to look carefully at the queries and at the connector code to figure out exactly how that can happen, and then I can let you know whether it's a bug in the code or a bug in your queries. Stay tuned. > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: 636_bb2.csv > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed: document '636' > java.lang.IllegalStateException: Multiple document primary component > dispositions not allowed: document '636' > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?]{noformat} > I looked this error up on the internet and it said that it might have > something to do with using the same key for different lines. > I checked, but I couldn't find any duplicates that match any of the selected > fields in the JDBC. > Hereby my queries: > Seeding query > {code:java} > SELECT pk1 as $(IDCOLUMN) > FROM dbo.bb2 > WHERE search_url IS NOT NULL > AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', > 'application/xml', 'application/zip'); > {code} > Version check query: none > Access token query: none > Data query: > > > {code:java} > SELECT > pk1 AS $(IDCOLUMN), > search_url AS $(URLCOLUMN), > ISNULL(content, '') AS $(DATACOLUMN), > doc_id, > search_url AS url, > ISNULL(title, '') as title, > ISNULL(groups,'') as groups, > ISNULL(type,'') as document_type, > ISNULL(users, '') as users > FROM dbo.bb2 > WHERE pk1 IN $(IDLIST); > {code} > The hereby added csv is the corresponding line from the table. > [^636_bb2.csv] > > Due to this problem, the whole crawling pipeline is being held up. It keeps > on retrying this line. > Could you help me understand this error? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1579: --- Assignee: Karl Wright > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: 636_bb2.csv > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed: document '636' > java.lang.IllegalStateException: Multiple document primary component > dispositions not allowed: document '636' > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?]{noformat} > I looked this error up on the internet and it said that it might have > something to do with using the same key for different lines. > I checked, but I couldn't find any duplicates that match any of the selected > fields in the JDBC. > Hereby my queries: > Seeding query > {code:java} > SELECT pk1 as $(IDCOLUMN) > FROM dbo.bb2 > WHERE search_url IS NOT NULL > AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', > 'application/xml', 'application/zip'); > {code} > Version check query: none > Access token query: none > Data query: > > > {code:java} > SELECT > pk1 AS $(IDCOLUMN), > search_url AS $(URLCOLUMN), > ISNULL(content, '') AS $(DATACOLUMN), > doc_id, > search_url AS url, > ISNULL(title, '') as title, > ISNULL(groups,'') as groups, > ISNULL(type,'') as document_type, > ISNULL(users, '') as users > FROM dbo.bb2 > WHERE pk1 IN $(IDLIST); > {code} > The hereby added csv is the corresponding line from the table. > [^636_bb2.csv] > > Due to this problem, the whole crawling pipeline is being held up. It keeps > on retrying this line. > Could you help me understand this error? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Apache CXF question
Thanks, Kishore -- but I already have the documentation. What I need is Apache CXF expertise. ;-) Karl On Fri, Feb 1, 2019 at 2:28 PM Kishore Kumar wrote: > Hi Karl, > > Good morning, I have shared you a Dropbox shared folder with OpenText > Content Server Web Service Documentation. > > If you have not received the link from Dropbox in your inbox, check in > Spam or let me know. > > Thanks, > Kishore Kumar > > -----Original Message- > From: Karl Wright > Sent: 01 February 2019 05:48 > To: dev ; Rafa Haro > Subject: Apache CXF question > > I'm still working on the new OpenText connector, now using Apache CXF to > handle the web services piece. I've never worked with this package before, > but I've got the WSDLs generating what looks like usable java classes > representing the WSDL interfaces. But the underlying transport is > mysterious given what is generated. So, two questions: > > (1) It doesn't appear to me like explicit generation of classes from the > XSD are needed here. It looks like CXF does that too. Am I wrong? > (2) I want the transport to go via an HttpComponents/HttpClient HttpClient > object that I create and initialize myself. How can I set that up? If > anyone on this list has a few snippets of code they can share it would be > great. > > Thanks in advance, > Karl >
Apache CXF question
I'm still working on the new OpenText connector, now using Apache CXF to handle the web services piece. I've never worked with this package before, but I've got the WSDLs generating what looks like usable java classes representing the WSDL interfaces. But the underlying transport is mysterious given what is generated. So, two questions: (1) It doesn't appear to me like explicit generation of classes from the XSD are needed here. It looks like CXF does that too. Am I wrong? (2) I want the transport to go via an HttpComponents/HttpClient HttpClient object that I create and initialize myself. How can I set that up? If anyone on this list has a few snippets of code they can share it would be great. Thanks in advance, Karl
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757517#comment-16757517 ] Karl Wright commented on CONNECTORS-1564: - [~michael-o], we have zero control over whether/when this gets addressed in SolrJ. Previous interactions with the SolrJ developers do not make me feel like a fix would likely be a prompt one. But I suggest that [~erlendfg] at least take the step of opening a ticket. We can afford to wait until the next MCF release is imminent before taking any action, but if there's no resolution in sight then, I think we should implement the workaround for the time being. > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen > Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757137#comment-16757137 ] Karl Wright commented on CONNECTORS-1564: - [~erlendfg], if SolrJ is overriding our .setExpectContinue(true), then your workaround is pretty reasonable, and I'd be happy to commit that (as long as you include enough comment so that we can figure out what we were thinking later). > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen > Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756074#comment-16756074 ] Karl Wright commented on CONNECTORS-1564: - The way you tell it is this: {code} request.setProtocolVersion(HttpVersion.HTTP_1_1); {code} I suspect there's a similar method in the RequestOptions builder. But I bet one of the things we're doing in the builder is convincing it that it's HTTP 1.0, and that's the problem. We need to figure out what it is. > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen > Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756071#comment-16756071 ] Karl Wright commented on CONNECTORS-1564: - Oh, and I vaguely recall something -- that since the expect-continue header is for HTTP 1.1 (and not HTTP 1.0), there was code in HttpComponents/HttpClient that disabled it if the client thought it was working in an HTTP 1.0 environment. I wonder if we just need to tell it somehow that it's HTTP 1.1? > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen > Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756069#comment-16756069 ] Karl Wright commented on CONNECTORS-1564: - [~erlendfg], forcing the header would be a last resort. But we can do it if we must. However there are about a dozen connectors that rely on this functionality working properly, so I really want to know what is going wrong. Can you experiment with changing the order of the builder method invocations for HttpClient in HttpPoster? It's the only thing I can think of that might be germane. Perhaps if toString() isn't helpful, you can still inspect the property in question. Is there a getter method for useExpectContinue? > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen > Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: About publishing in mvn central repository
There's a ticket outstanding for this but nobody could figure out how to do it, since the jars are built with Ant not Maven. If you want to work out how, please feel free to go ahead. Karl On Wed, Jan 30, 2019 at 7:08 AM Cihad Guzel wrote: > Hi, > > There aren't Manifoldcf jar packages in the mvn central repository. Maybe > they can be published in the repository? So we can add mcf-core or other > mfc jar packages to our projects as dependency. > > What do you think about that? > > [1] > https://maven.apache.org/repository/guide-central-repository-upload.html > > > Regards, > Cihad Güzel >
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755582#comment-16755582 ] Karl Wright commented on CONNECTORS-1564: - [~erlendfg], are you in a position to build MCF and experiment with how the HttpClient is constructed in HttpPoster.java? I suspect that what is happening is that the expect/continue is indeed being set but something that is later done to the builder is turning it back off again. So I would suggest adding a log.debug("httpclientbuilder = "+httpClientBuilder) line in there before we actually use the builder to construct the client, to see if this is the case, and if so, try to figure out which addition is causing the flag to be flipped back. > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen >Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1673#comment-1673 ] Karl Wright edited comment on CONNECTORS-1564 at 1/30/19 1:37 AM: -- [~michael-o] We're using the standard setup code that was recommended by Oleg. If the builders have decent toString() methods, we can dump them to the log when we create the HttpClient object to confirm they are set up correctly. But from the beginning we could see nothing wrong with it. This was the test you said was working: {code} HttpClientBuilder builder = HttpClientBuilder.create(); RequestConfig rc = RequestConfig.custom().setExpectContinueEnabled(true).build(); builder.setDefaultRequestConfig(rc); {code} We will figure out what winds up canceling out the expect/continue flag, if that's what indeed is happening. was (Author: kwri...@metacarta.com): [~michael-o] We're using the standard setup code that was recommended by Oleg. If the builders have decent toString() methods, we can dump them to the log when we create the HttpClient object to confirm they are set up correctly. But from the beginning we could see nothing wrong with it. Can you include the test example here that you used to verify that expect-continue was working? > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen > Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1673#comment-1673 ] Karl Wright commented on CONNECTORS-1564: - [~michael-o] We're using the standard setup code that was recommended by Oleg. If the builders have decent toString() methods, we can dump them to the log when we create the HttpClient object to confirm they are set up correctly. But from the beginning we could see nothing wrong with it. Can you include the test example here that you used to verify that expect-continue was working? > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen > Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1576) Running Multiple Jobs in ManifoldCF
[ https://issues.apache.org/jira/browse/CONNECTORS-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1576. - Resolution: Not A Problem > Running Multiple Jobs in ManifoldCF > --- > > Key: CONNECTORS-1576 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1576 > Project: ManifoldCF > Issue Type: Bug > Components: Documentum connector >Affects Versions: ManifoldCF 2.9.1 >Reporter: Pavithra Dhakshinamurthy >Priority: Major > Labels: features > Fix For: ManifoldCF 2.9.1 > > > Hi, > We have configured two jobs to index documentum contents. when running it in > parallel, seeding is working fine. But only one job processes the document > and pushes to ES. After the first job completes, the second job is processing > the document. > Is this the expected behavior? Or Are we missing anything? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1576) Running Multiple Jobs in ManifoldCF
[ https://issues.apache.org/jira/browse/CONNECTORS-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755215#comment-16755215 ] Karl Wright commented on CONNECTORS-1576: - The documents that have been queued at the time the second job is started all must be processed before any documents from the second job are picked up. This is because of how documents are assigned priorities in the database. Once you get past the initial bunch of queued documents then both jobs will run simultaneously. > Running Multiple Jobs in ManifoldCF > --- > > Key: CONNECTORS-1576 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1576 > Project: ManifoldCF > Issue Type: Bug > Components: Documentum connector >Affects Versions: ManifoldCF 2.9.1 >Reporter: Pavithra Dhakshinamurthy >Priority: Major > Labels: features > Fix For: ManifoldCF 2.9.1 > > > Hi, > We have configured two jobs to index documentum contents. when running it in > parallel, seeding is working fine. But only one job processes the document > and pushes to ES. After the first job completes, the second job is processing > the document. > Is this the expected behavior? Or Are we missing anything? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755212#comment-16755212 ] Karl Wright commented on CONNECTORS-1564: - [~michael-o], you need to be looking here: {code} https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors/solr/connector/src/main/java/org/apache/manifoldcf/agents/output/solr/HttpPoster.java {code} ManifoldCF has its own HttpClient construction. > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen > Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1575) inconsistant use of value-labels
[ https://issues.apache.org/jira/browse/CONNECTORS-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753908#comment-16753908 ] Karl Wright commented on CONNECTORS-1575: - This is because there are two somewhat different internal representations involved. While it is unfortunate that they appear inconsistent, there is nothing that can be done to change them since doing so would be backwards incompatible. > inconsistant use of value-labels > - > > Key: CONNECTORS-1575 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1575 > Project: ManifoldCF > Issue Type: Bug > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > Attachments: image-2019-01-28-11-57-46-738.png > > > When retrieving a job, using the API there seems to be inconsistencies in the > return JSON of a job. > For the schedule value of 'hourofday', 'minutesofhour', etc. the label of the > value is 'value' while for all other value-labels it is '_value_'. > > !image-2019-01-28-11-57-46-738.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1574) Performance tuning of manifold
[ https://issues.apache.org/jira/browse/CONNECTORS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753913#comment-16753913 ] Karl Wright commented on CONNECTORS-1574: - If you look in the ManifoldCF log, all queries that take more than a minute to execute are logged, along with an EXPLAIN plan. Could you look at your logs and find the queries and provide their explanation? The quality of the query plans is usually dependent on the quality of the statistics that the database keeps. When the statistics are out of date, then the plan sometimes gets horribly bad. ManifoldCF *attempts* to keep up with this by re-analyzing tables after a fixed number of changes, but necessarily it cannot do better than estimate the number of changes and their effects on the table statistics. So if you are experiencing problems with certain queries, you can set properties.xml values that increase the frequency of analyze operations for that table. But first we need to know what's going wrong. > Performance tuning of manifold > -- > > Key: CONNECTORS-1574 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1574 > Project: ManifoldCF > Issue Type: Bug > Components: File system connector, JCIFS connector, Solr 6.x > component >Affects Versions: ManifoldCF 2.5 > Environment: Apache manifold installed in Linux machine > Linux version 3.10.0-327.el7.ppc64le > Red Hat Enterprise Linux Server release 7.2 (Maipo) > Reporter: balaji >Assignee: Karl Wright >Priority: Critical > Labels: performance > > My team is using *Apache ManifoldCF 2.5 with SOLR Cloud* for indexing of > data. we are currently having 450-500 jobs which needs to run simultaneously. > We need to index json data and we are using connector type as *file system* > along with *postgres* as backend database. > We are facing several issues like > 1. Scheduling works for some jobs and doesn't work for other jobs. > 2. Some jobs gets completed and some jobs hangs and doesn't get completed. > 3. With one job earlier 6 documents was getting indexed in 15minutes but > now even a directory path having 5 documents takes 20 minutes or sometimes > doesn't get completed > 4. "list all jobs" or "status and job management" page doesn't load sometimes > and on seeing the pg_stat_activity we observe that 2 queries are in waiting > state state because of which the page doesn't load. so if we kill those > queries or restart manifold the issue gets resolved and the page loads > properly > queries getting stuck: > 1. SELECT ID,FAILTIME, FAILCOUNT, SEEDINGVERSION, STATUS FROM JOBS WHERE > (STATUS=$1 OR STATUS=$2) FOR UPDATE > 2. UPDATE JOBS SET ERRORTEXT=NULL, ENDTIME=NULL, WINDOWEND=NULL, STATUS=$1 > WHERE ID=$2 > note : We have deployed manifold in *linux*. Our major requirement is > scheduling of jobs which will run every 15 minutes > Please help us in fine tuning manifold so that it runs smoothly and acts as a > robust system. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1574) Performance tuning of manifold
[ https://issues.apache.org/jira/browse/CONNECTORS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1574: --- Assignee: Karl Wright > Performance tuning of manifold > -- > > Key: CONNECTORS-1574 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1574 > Project: ManifoldCF > Issue Type: Bug > Components: File system connector, JCIFS connector, Solr 6.x > component >Affects Versions: ManifoldCF 2.5 > Environment: Apache manifold installed in Linux machine > Linux version 3.10.0-327.el7.ppc64le > Red Hat Enterprise Linux Server release 7.2 (Maipo) > Reporter: balaji >Assignee: Karl Wright >Priority: Critical > Labels: performance > > My team is using *Apache ManifoldCF 2.5 with SOLR Cloud* for indexing of > data. we are currently having 450-500 jobs which needs to run simultaneously. > We need to index json data and we are using connector type as *file system* > along with *postgres* as backend database. > We are facing several issues like > 1. Scheduling works for some jobs and doesn't work for other jobs. > 2. Some jobs gets completed and some jobs hangs and doesn't get completed. > 3. With one job earlier 6 documents was getting indexed in 15minutes but > now even a directory path having 5 documents takes 20 minutes or sometimes > doesn't get completed > 4. "list all jobs" or "status and job management" page doesn't load sometimes > and on seeing the pg_stat_activity we observe that 2 queries are in waiting > state state because of which the page doesn't load. so if we kill those > queries or restart manifold the issue gets resolved and the page loads > properly > queries getting stuck: > 1. SELECT ID,FAILTIME, FAILCOUNT, SEEDINGVERSION, STATUS FROM JOBS WHERE > (STATUS=$1 OR STATUS=$2) FOR UPDATE > 2. UPDATE JOBS SET ERRORTEXT=NULL, ENDTIME=NULL, WINDOWEND=NULL, STATUS=$1 > WHERE ID=$2 > note : We have deployed manifold in *linux*. Our major requirement is > scheduling of jobs which will run every 15 minutes > Please help us in fine tuning manifold so that it runs smoothly and acts as a > robust system. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1575) inconsistant use of value-labels
[ https://issues.apache.org/jira/browse/CONNECTORS-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1575. - Resolution: Won't Fix > inconsistant use of value-labels > - > > Key: CONNECTORS-1575 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1575 > Project: ManifoldCF > Issue Type: Bug > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > Attachments: image-2019-01-28-11-57-46-738.png > > > When retrieving a job, using the API there seems to be inconsistencies in the > return JSON of a job. > For the schedule value of 'hourofday', 'minutesofhour', etc. the label of the > value is 'value' while for all other value-labels it is '_value_'. > > !image-2019-01-28-11-57-46-738.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Mambo CMS
We do not have Mambo connectors in MCF. I don't know anything about CMIS support in that offering either. Karl On Sat, Jan 26, 2019 at 8:17 AM Furkan KAMACI wrote: > Hi All, > > Mambo (http://mambo-foundation.org) is an open source CMS system which is > being used by many companies. > > Do we have a Mambo integration via ManifoldCF or does anybody knows Mambo > supports our CMIS connector? > > If not, we can suggest it as a GSoC project for 2019. > > Kind Regards, > Furkan KAMACI >
Re: Axis question
I was able to get the wsdl->java compilation working without downloading a ton of additional dependencies, and with cxf version 2.6.2. Thanks, Rafa, for your help in getting this far. Karl On Fri, Jan 25, 2019 at 4:11 PM Karl Wright wrote: > That's one approach. I'm not thrilled with it; we cannot guarantee no > client wsdl changes over time. But if there's nothing better we'll have to > live with it. > > The real problem, of course, is that code generated with version X of cxf > requires runtime libraries from version X, and that's still a conflict. So > I need to get the WSDL2Java going for 2.6.2. > > Karl > > > On Fri, Jan 25, 2019 at 3:54 PM Rafa Haro wrote: > >> I would try to be pragmatic. If those wsdl are not likely to change in the >> future, I would build the client classes offline. Not sure if the >> generated >> class are going to use further classes of cxf and then the problem could >> end up being the same, but it is worth to try >> >> El El vie, 25 ene 2019 a las 21:14, Karl Wright >> escribió: >> >> > I downloaded the cxf binary, latest version. >> > The dependency list is huge and very likely conflicts with existing >> > connectors which have dependencies on cxf 2.x. I would estimate that >> > including all the new jars and dependencies would easily double our >> > download footprint. >> > >> > Surely there must be a list of the minimal jars needed to get >> WSDLToJava to >> > function somewhere? >> > >> > Karl >> > >> > >> > >> > >> > On Fri, Jan 25, 2019 at 2:14 PM Karl Wright wrote: >> > >> > > I'm not getting missing cxf jars. I'm getting problems with >> downstream >> > > dependencies. >> > > >> > > We don't usually ship more jars than we need to, is the short answer >> to >> > > your second question. >> > > >> > > Karl >> > > >> > > >> > > On Fri, Jan 25, 2019 at 11:38 AM Rafa Haro wrote: >> > > >> > >> which jars are you downloading?. Why not getting the whole release? >> > >> >> > >> On Fri, Jan 25, 2019 at 5:31 PM Rafa Haro wrote: >> > >> >> > >>> Not sure, Karl I just picked up last release. I can try to find the >> > >>> first version offering it but as long as they have backwards >> > compatibility >> > >>> we should be fine with the last version although we might need to >> > update >> > >>> the affected connectors >> > >>> >> > >>> Rafa >> > >>> >> > >>> On Fri, Jan 25, 2019 at 3:53 PM Karl Wright >> > wrote: >> > >>> >> > >>>> When did it first appear? We're currently on 2.6.2; this is set by >> > >>>> various dependencies by our connectors. >> > >>>> >> > >>>> Karl >> > >>>> >> > >>>> On Fri, Jan 25, 2019 at 9:52 AM Karl Wright >> > wrote: >> > >>>> >> > >>>>> The tools package doesn't seem to have it either. >> > >>>>> Karl >> > >>>>> >> > >>>>> >> > >>>>> On Fri, Jan 25, 2019 at 9:43 AM Karl Wright >> > >>>>> wrote: >> > >>>>> >> > >>>>>> Do you know what jar/maven package this is in? because I don't >> seem >> > >>>>>> to have it in our normal cxf jars... >> > >>>>>> >> > >>>>>> Karl >> > >>>>>> >> > >>>>>> >> > >>>>>> On Fri, Jan 25, 2019 at 9:08 AM Rafa Haro >> wrote: >> > >>>>>> >> > >>>>>>> I used a wsdl2java script that comes as an utility of the apache >> > cxf >> > >>>>>>> release, but basically is making use >> > >>>>>>> of org.apache.cxf.tools.wsdlto.WSDLToJava class. You can find >> here >> > >>>>>>> an usage >> > >>>>>>> example with ant: http://cxf.apache.org/docs/wsdl-to-java.html >> > >>>>>>> >> > >>>>>>> On Fri, Jan 25, 2019 at 2:59 PM Karl Wright > > >> > >>>>>>> wrote: >> > >>>>>
Re: Axis question
That's one approach. I'm not thrilled with it; we cannot guarantee no client wsdl changes over time. But if there's nothing better we'll have to live with it. The real problem, of course, is that code generated with version X of cxf requires runtime libraries from version X, and that's still a conflict. So I need to get the WSDL2Java going for 2.6.2. Karl On Fri, Jan 25, 2019 at 3:54 PM Rafa Haro wrote: > I would try to be pragmatic. If those wsdl are not likely to change in the > future, I would build the client classes offline. Not sure if the generated > class are going to use further classes of cxf and then the problem could > end up being the same, but it is worth to try > > El El vie, 25 ene 2019 a las 21:14, Karl Wright > escribió: > > > I downloaded the cxf binary, latest version. > > The dependency list is huge and very likely conflicts with existing > > connectors which have dependencies on cxf 2.x. I would estimate that > > including all the new jars and dependencies would easily double our > > download footprint. > > > > Surely there must be a list of the minimal jars needed to get WSDLToJava > to > > function somewhere? > > > > Karl > > > > > > > > > > On Fri, Jan 25, 2019 at 2:14 PM Karl Wright wrote: > > > > > I'm not getting missing cxf jars. I'm getting problems with downstream > > > dependencies. > > > > > > We don't usually ship more jars than we need to, is the short answer to > > > your second question. > > > > > > Karl > > > > > > > > > On Fri, Jan 25, 2019 at 11:38 AM Rafa Haro wrote: > > > > > >> which jars are you downloading?. Why not getting the whole release? > > >> > > >> On Fri, Jan 25, 2019 at 5:31 PM Rafa Haro wrote: > > >> > > >>> Not sure, Karl I just picked up last release. I can try to find the > > >>> first version offering it but as long as they have backwards > > compatibility > > >>> we should be fine with the last version although we might need to > > update > > >>> the affected connectors > > >>> > > >>> Rafa > > >>> > > >>> On Fri, Jan 25, 2019 at 3:53 PM Karl Wright > > wrote: > > >>> > > >>>> When did it first appear? We're currently on 2.6.2; this is set by > > >>>> various dependencies by our connectors. > > >>>> > > >>>> Karl > > >>>> > > >>>> On Fri, Jan 25, 2019 at 9:52 AM Karl Wright > > wrote: > > >>>> > > >>>>> The tools package doesn't seem to have it either. > > >>>>> Karl > > >>>>> > > >>>>> > > >>>>> On Fri, Jan 25, 2019 at 9:43 AM Karl Wright > > >>>>> wrote: > > >>>>> > > >>>>>> Do you know what jar/maven package this is in? because I don't > seem > > >>>>>> to have it in our normal cxf jars... > > >>>>>> > > >>>>>> Karl > > >>>>>> > > >>>>>> > > >>>>>> On Fri, Jan 25, 2019 at 9:08 AM Rafa Haro > wrote: > > >>>>>> > > >>>>>>> I used a wsdl2java script that comes as an utility of the apache > > cxf > > >>>>>>> release, but basically is making use > > >>>>>>> of org.apache.cxf.tools.wsdlto.WSDLToJava class. You can find > here > > >>>>>>> an usage > > >>>>>>> example with ant: http://cxf.apache.org/docs/wsdl-to-java.html > > >>>>>>> > > >>>>>>> On Fri, Jan 25, 2019 at 2:59 PM Karl Wright > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>> > I was using ancient Axis 1.4 and none of them were working. > You > > >>>>>>> can > > >>>>>>> > exercise this with "ant classcreate-wsdls" in the csws > directory. > > >>>>>>> > > > >>>>>>> > If you can give instructions for invoking CXF, maybe we can do > > that > > >>>>>>> > instead. What's the main class, and what jars do we need to > > >>>>>>> include? > > >>>>>>> > > > >>>>>>> >
Re: Axis question
I downloaded the cxf binary, latest version. The dependency list is huge and very likely conflicts with existing connectors which have dependencies on cxf 2.x. I would estimate that including all the new jars and dependencies would easily double our download footprint. Surely there must be a list of the minimal jars needed to get WSDLToJava to function somewhere? Karl On Fri, Jan 25, 2019 at 2:14 PM Karl Wright wrote: > I'm not getting missing cxf jars. I'm getting problems with downstream > dependencies. > > We don't usually ship more jars than we need to, is the short answer to > your second question. > > Karl > > > On Fri, Jan 25, 2019 at 11:38 AM Rafa Haro wrote: > >> which jars are you downloading?. Why not getting the whole release? >> >> On Fri, Jan 25, 2019 at 5:31 PM Rafa Haro wrote: >> >>> Not sure, Karl I just picked up last release. I can try to find the >>> first version offering it but as long as they have backwards compatibility >>> we should be fine with the last version although we might need to update >>> the affected connectors >>> >>> Rafa >>> >>> On Fri, Jan 25, 2019 at 3:53 PM Karl Wright wrote: >>> >>>> When did it first appear? We're currently on 2.6.2; this is set by >>>> various dependencies by our connectors. >>>> >>>> Karl >>>> >>>> On Fri, Jan 25, 2019 at 9:52 AM Karl Wright wrote: >>>> >>>>> The tools package doesn't seem to have it either. >>>>> Karl >>>>> >>>>> >>>>> On Fri, Jan 25, 2019 at 9:43 AM Karl Wright >>>>> wrote: >>>>> >>>>>> Do you know what jar/maven package this is in? because I don't seem >>>>>> to have it in our normal cxf jars... >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Fri, Jan 25, 2019 at 9:08 AM Rafa Haro wrote: >>>>>> >>>>>>> I used a wsdl2java script that comes as an utility of the apache cxf >>>>>>> release, but basically is making use >>>>>>> of org.apache.cxf.tools.wsdlto.WSDLToJava class. You can find here >>>>>>> an usage >>>>>>> example with ant: http://cxf.apache.org/docs/wsdl-to-java.html >>>>>>> >>>>>>> On Fri, Jan 25, 2019 at 2:59 PM Karl Wright >>>>>>> wrote: >>>>>>> >>>>>>> > I was using ancient Axis 1.4 and none of them were working. You >>>>>>> can >>>>>>> > exercise this with "ant classcreate-wsdls" in the csws directory. >>>>>>> > >>>>>>> > If you can give instructions for invoking CXF, maybe we can do that >>>>>>> > instead. What's the main class, and what jars do we need to >>>>>>> include? >>>>>>> > >>>>>>> > Karl >>>>>>> > >>>>>>> > >>>>>>> > On Fri, Jan 25, 2019 at 7:28 AM Rafa Haro >>>>>>> wrote: >>>>>>> > >>>>>>> >> Yes, I did. I have only tested Authentication service with Apache >>>>>>> CXF and >>>>>>> >> it was apparently working fine. Which ones were failing for you? >>>>>>> >> >>>>>>> >> On Fri, Jan 25, 2019 at 12:38 PM Karl Wright >>>>>>> wrote: >>>>>>> >> >>>>>>> >>> Were you able to look at this yesterday at all? >>>>>>> >>> Karl >>>>>>> >>> >>>>>>> >>> On Thu, Jan 24, 2019 at 6:34 AM Karl Wright >>>>>>> wrote: >>>>>>> >>> >>>>>>> >>>> They're all checked in. >>>>>>> >>>> >>>>>>> >>>> See >>>>>>> >>>> >>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1566/connectors/csws/wsdls >>>>>>> >>>> >>>>>>> >>>> Karl >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> On Thu, Jan 24, 2019 at 6:24 AM Rafa Haro >>>>>>> wrote: >>>>>>> >>>> >>>>>>> >>>>> Karl, can you share the WSDL, I can try to take a look later >>>>>>> today >>>>>>> >>>>> >>>>>>> >>>>> On Thu, Jan 24, 2019 at 12:13 PM Karl Wright < >>>>>>> daddy...@gmail.com> >>>>>>> >>>>> wrote: >>>>>>> >>>>> >>>>>>> >>>>> > I'm redeveloping the Livelink connector because the API code >>>>>>> has been >>>>>>> >>>>> > discontinued and the only API is now web services based. >>>>>>> The WSDLs >>>>>>> >>>>> and >>>>>>> >>>>> > XSDs have been exported and I'm trying to use the Axis tool >>>>>>> >>>>> WSDL2Java to >>>>>>> >>>>> > convert to Java code. Unfortunately, I haven't been able to >>>>>>> make >>>>>>> >>>>> this work >>>>>>> >>>>> > -- even though the WSDLs references have been made local and >>>>>>> the >>>>>>> >>>>> XSDs also >>>>>>> >>>>> > seem to be getting parsed, it complains about missing >>>>>>> definitions, >>>>>>> >>>>> even >>>>>>> >>>>> > though those definitions are clearly present in the XSD >>>>>>> files. >>>>>>> >>>>> > >>>>>>> >>>>> > Has anyone had enough experience with this tool, and web >>>>>>> services in >>>>>>> >>>>> > general, to figure out what's wrong? I've tried turning on >>>>>>> as >>>>>>> >>>>> verbose a >>>>>>> >>>>> > debugging level for WSDL2Java as I can and it's no help at >>>>>>> all. I >>>>>>> >>>>> suspect >>>>>>> >>>>> > namespace issues but I can't figure out what they are. >>>>>>> >>>>> > >>>>>>> >>>>> > Thanks in advance, >>>>>>> >>>>> > Karl >>>>>>> >>>>> > >>>>>>> >>>>> >>>>>>> >>>> >>>>>>> >>>>>>
Re: Axis question
I'm not getting missing cxf jars. I'm getting problems with downstream dependencies. We don't usually ship more jars than we need to, is the short answer to your second question. Karl On Fri, Jan 25, 2019 at 11:38 AM Rafa Haro wrote: > which jars are you downloading?. Why not getting the whole release? > > On Fri, Jan 25, 2019 at 5:31 PM Rafa Haro wrote: > >> Not sure, Karl I just picked up last release. I can try to find the first >> version offering it but as long as they have backwards compatibility we >> should be fine with the last version although we might need to update the >> affected connectors >> >> Rafa >> >> On Fri, Jan 25, 2019 at 3:53 PM Karl Wright wrote: >> >>> When did it first appear? We're currently on 2.6.2; this is set by >>> various dependencies by our connectors. >>> >>> Karl >>> >>> On Fri, Jan 25, 2019 at 9:52 AM Karl Wright wrote: >>> >>>> The tools package doesn't seem to have it either. >>>> Karl >>>> >>>> >>>> On Fri, Jan 25, 2019 at 9:43 AM Karl Wright wrote: >>>> >>>>> Do you know what jar/maven package this is in? because I don't seem >>>>> to have it in our normal cxf jars... >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Fri, Jan 25, 2019 at 9:08 AM Rafa Haro wrote: >>>>> >>>>>> I used a wsdl2java script that comes as an utility of the apache cxf >>>>>> release, but basically is making use >>>>>> of org.apache.cxf.tools.wsdlto.WSDLToJava class. You can find here an >>>>>> usage >>>>>> example with ant: http://cxf.apache.org/docs/wsdl-to-java.html >>>>>> >>>>>> On Fri, Jan 25, 2019 at 2:59 PM Karl Wright >>>>>> wrote: >>>>>> >>>>>> > I was using ancient Axis 1.4 and none of them were working. You can >>>>>> > exercise this with "ant classcreate-wsdls" in the csws directory. >>>>>> > >>>>>> > If you can give instructions for invoking CXF, maybe we can do that >>>>>> > instead. What's the main class, and what jars do we need to >>>>>> include? >>>>>> > >>>>>> > Karl >>>>>> > >>>>>> > >>>>>> > On Fri, Jan 25, 2019 at 7:28 AM Rafa Haro wrote: >>>>>> > >>>>>> >> Yes, I did. I have only tested Authentication service with Apache >>>>>> CXF and >>>>>> >> it was apparently working fine. Which ones were failing for you? >>>>>> >> >>>>>> >> On Fri, Jan 25, 2019 at 12:38 PM Karl Wright >>>>>> wrote: >>>>>> >> >>>>>> >>> Were you able to look at this yesterday at all? >>>>>> >>> Karl >>>>>> >>> >>>>>> >>> On Thu, Jan 24, 2019 at 6:34 AM Karl Wright >>>>>> wrote: >>>>>> >>> >>>>>> >>>> They're all checked in. >>>>>> >>>> >>>>>> >>>> See >>>>>> >>>> >>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1566/connectors/csws/wsdls >>>>>> >>>> >>>>>> >>>> Karl >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> On Thu, Jan 24, 2019 at 6:24 AM Rafa Haro >>>>>> wrote: >>>>>> >>>> >>>>>> >>>>> Karl, can you share the WSDL, I can try to take a look later >>>>>> today >>>>>> >>>>> >>>>>> >>>>> On Thu, Jan 24, 2019 at 12:13 PM Karl Wright < >>>>>> daddy...@gmail.com> >>>>>> >>>>> wrote: >>>>>> >>>>> >>>>>> >>>>> > I'm redeveloping the Livelink connector because the API code >>>>>> has been >>>>>> >>>>> > discontinued and the only API is now web services based. The >>>>>> WSDLs >>>>>> >>>>> and >>>>>> >>>>> > XSDs have been exported and I'm trying to use the Axis tool >>>>>> >>>>> WSDL2Java to >>>>>> >>>>> > convert to Java code. Unfortunately, I haven't been able to >>>>>> make >>>>>> >>>>> this work >>>>>> >>>>> > -- even though the WSDLs references have been made local and >>>>>> the >>>>>> >>>>> XSDs also >>>>>> >>>>> > seem to be getting parsed, it complains about missing >>>>>> definitions, >>>>>> >>>>> even >>>>>> >>>>> > though those definitions are clearly present in the XSD files. >>>>>> >>>>> > >>>>>> >>>>> > Has anyone had enough experience with this tool, and web >>>>>> services in >>>>>> >>>>> > general, to figure out what's wrong? I've tried turning on as >>>>>> >>>>> verbose a >>>>>> >>>>> > debugging level for WSDL2Java as I can and it's no help at >>>>>> all. I >>>>>> >>>>> suspect >>>>>> >>>>> > namespace issues but I can't figure out what they are. >>>>>> >>>>> > >>>>>> >>>>> > Thanks in advance, >>>>>> >>>>> > Karl >>>>>> >>>>> > >>>>>> >>>>> >>>>>> >>>> >>>>>> >>>>>
Re: Axis question
I've been fighting with this pretty hard for a couple of hours now. I did find the proper cxf tools jar eventually but I'm getting one dependency problem after another. Currently I have: >>>>>> classcreate-wsdl-cxf: [mkdir] Created dir: /mnt/c/wip/mcf/CONNECTORS-1566/connectors/csws/build/wsdljava [java] Jan 25, 2019 4:09:02 PM org.apache.cxf.staxutils.StaxUtils createXMLInputFactory [java] WARNING: Could not create a secure Stax XMLInputFactory. Found class com.sun.xml.internal.stream.XMLInputFactoryImpl. Suggest Woodstox 4.2.0 or newer. [java] Jan 25, 2019 4:09:03 PM org.apache.cxf.staxutils.StaxUtils createXMLInputFactory [java] WARNING: Could not create a secure Stax XMLInputFactory. Found class com.sun.xml.internal.stream.XMLInputFactoryImpl. Suggest Woodstox 4.2.0 or newer. [java] [java] WSDLToJava Error: Could not find jaxws frontend within classpath [java] <<<<<< ... even though I have jaxws* in the path and woodstox 5.7 too. Rafa, can you tell me what classpath you are using and what the full dependencies are for this tool? Karl On Fri, Jan 25, 2019 at 9:53 AM Karl Wright wrote: > When did it first appear? We're currently on 2.6.2; this is set by > various dependencies by our connectors. > > Karl > > On Fri, Jan 25, 2019 at 9:52 AM Karl Wright wrote: > >> The tools package doesn't seem to have it either. >> Karl >> >> >> On Fri, Jan 25, 2019 at 9:43 AM Karl Wright wrote: >> >>> Do you know what jar/maven package this is in? because I don't seem to >>> have it in our normal cxf jars... >>> >>> Karl >>> >>> >>> On Fri, Jan 25, 2019 at 9:08 AM Rafa Haro wrote: >>> >>>> I used a wsdl2java script that comes as an utility of the apache cxf >>>> release, but basically is making use >>>> of org.apache.cxf.tools.wsdlto.WSDLToJava class. You can find here an >>>> usage >>>> example with ant: http://cxf.apache.org/docs/wsdl-to-java.html >>>> >>>> On Fri, Jan 25, 2019 at 2:59 PM Karl Wright wrote: >>>> >>>> > I was using ancient Axis 1.4 and none of them were working. You can >>>> > exercise this with "ant classcreate-wsdls" in the csws directory. >>>> > >>>> > If you can give instructions for invoking CXF, maybe we can do that >>>> > instead. What's the main class, and what jars do we need to include? >>>> > >>>> > Karl >>>> > >>>> > >>>> > On Fri, Jan 25, 2019 at 7:28 AM Rafa Haro wrote: >>>> > >>>> >> Yes, I did. I have only tested Authentication service with Apache >>>> CXF and >>>> >> it was apparently working fine. Which ones were failing for you? >>>> >> >>>> >> On Fri, Jan 25, 2019 at 12:38 PM Karl Wright >>>> wrote: >>>> >> >>>> >>> Were you able to look at this yesterday at all? >>>> >>> Karl >>>> >>> >>>> >>> On Thu, Jan 24, 2019 at 6:34 AM Karl Wright >>>> wrote: >>>> >>> >>>> >>>> They're all checked in. >>>> >>>> >>>> >>>> See >>>> >>>> >>>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1566/connectors/csws/wsdls >>>> >>>> >>>> >>>> Karl >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Jan 24, 2019 at 6:24 AM Rafa Haro >>>> wrote: >>>> >>>> >>>> >>>>> Karl, can you share the WSDL, I can try to take a look later today >>>> >>>>> >>>> >>>>> On Thu, Jan 24, 2019 at 12:13 PM Karl Wright >>>> >>>>> wrote: >>>> >>>>> >>>> >>>>> > I'm redeveloping the Livelink connector because the API code >>>> has been >>>> >>>>> > discontinued and the only API is now web services based. The >>>> WSDLs >>>> >>>>> and >>>> >>>>> > XSDs have been exported and I'm trying to use the Axis tool >>>> >>>>> WSDL2Java to >>>> >>>>> > convert to Java code. Unfortunately, I haven't been able to >>>> make >>>> >>>>> this work >>>> >>>>> > -- even though the WSDLs references have been made local and the >>>> >>>>> XSDs also >>>> >>>>> > seem to be getting parsed, it complains about missing >>>> definitions, >>>> >>>>> even >>>> >>>>> > though those definitions are clearly present in the XSD files. >>>> >>>>> > >>>> >>>>> > Has anyone had enough experience with this tool, and web >>>> services in >>>> >>>>> > general, to figure out what's wrong? I've tried turning on as >>>> >>>>> verbose a >>>> >>>>> > debugging level for WSDL2Java as I can and it's no help at >>>> all. I >>>> >>>>> suspect >>>> >>>>> > namespace issues but I can't figure out what they are. >>>> >>>>> > >>>> >>>>> > Thanks in advance, >>>> >>>>> > Karl >>>> >>>>> > >>>> >>>>> >>>> >>>> >>>> >>>
Re: Axis question
When did it first appear? We're currently on 2.6.2; this is set by various dependencies by our connectors. Karl On Fri, Jan 25, 2019 at 9:52 AM Karl Wright wrote: > The tools package doesn't seem to have it either. > Karl > > > On Fri, Jan 25, 2019 at 9:43 AM Karl Wright wrote: > >> Do you know what jar/maven package this is in? because I don't seem to >> have it in our normal cxf jars... >> >> Karl >> >> >> On Fri, Jan 25, 2019 at 9:08 AM Rafa Haro wrote: >> >>> I used a wsdl2java script that comes as an utility of the apache cxf >>> release, but basically is making use >>> of org.apache.cxf.tools.wsdlto.WSDLToJava class. You can find here an >>> usage >>> example with ant: http://cxf.apache.org/docs/wsdl-to-java.html >>> >>> On Fri, Jan 25, 2019 at 2:59 PM Karl Wright wrote: >>> >>> > I was using ancient Axis 1.4 and none of them were working. You can >>> > exercise this with "ant classcreate-wsdls" in the csws directory. >>> > >>> > If you can give instructions for invoking CXF, maybe we can do that >>> > instead. What's the main class, and what jars do we need to include? >>> > >>> > Karl >>> > >>> > >>> > On Fri, Jan 25, 2019 at 7:28 AM Rafa Haro wrote: >>> > >>> >> Yes, I did. I have only tested Authentication service with Apache CXF >>> and >>> >> it was apparently working fine. Which ones were failing for you? >>> >> >>> >> On Fri, Jan 25, 2019 at 12:38 PM Karl Wright >>> wrote: >>> >> >>> >>> Were you able to look at this yesterday at all? >>> >>> Karl >>> >>> >>> >>> On Thu, Jan 24, 2019 at 6:34 AM Karl Wright >>> wrote: >>> >>> >>> >>>> They're all checked in. >>> >>>> >>> >>>> See >>> >>>> >>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1566/connectors/csws/wsdls >>> >>>> >>> >>>> Karl >>> >>>> >>> >>>> >>> >>>> On Thu, Jan 24, 2019 at 6:24 AM Rafa Haro wrote: >>> >>>> >>> >>>>> Karl, can you share the WSDL, I can try to take a look later today >>> >>>>> >>> >>>>> On Thu, Jan 24, 2019 at 12:13 PM Karl Wright >>> >>>>> wrote: >>> >>>>> >>> >>>>> > I'm redeveloping the Livelink connector because the API code has >>> been >>> >>>>> > discontinued and the only API is now web services based. The >>> WSDLs >>> >>>>> and >>> >>>>> > XSDs have been exported and I'm trying to use the Axis tool >>> >>>>> WSDL2Java to >>> >>>>> > convert to Java code. Unfortunately, I haven't been able to make >>> >>>>> this work >>> >>>>> > -- even though the WSDLs references have been made local and the >>> >>>>> XSDs also >>> >>>>> > seem to be getting parsed, it complains about missing >>> definitions, >>> >>>>> even >>> >>>>> > though those definitions are clearly present in the XSD files. >>> >>>>> > >>> >>>>> > Has anyone had enough experience with this tool, and web >>> services in >>> >>>>> > general, to figure out what's wrong? I've tried turning on as >>> >>>>> verbose a >>> >>>>> > debugging level for WSDL2Java as I can and it's no help at all. >>> I >>> >>>>> suspect >>> >>>>> > namespace issues but I can't figure out what they are. >>> >>>>> > >>> >>>>> > Thanks in advance, >>> >>>>> > Karl >>> >>>>> > >>> >>>>> >>> >>>> >>> >>
Re: Axis question
The tools package doesn't seem to have it either. Karl On Fri, Jan 25, 2019 at 9:43 AM Karl Wright wrote: > Do you know what jar/maven package this is in? because I don't seem to > have it in our normal cxf jars... > > Karl > > > On Fri, Jan 25, 2019 at 9:08 AM Rafa Haro wrote: > >> I used a wsdl2java script that comes as an utility of the apache cxf >> release, but basically is making use >> of org.apache.cxf.tools.wsdlto.WSDLToJava class. You can find here an >> usage >> example with ant: http://cxf.apache.org/docs/wsdl-to-java.html >> >> On Fri, Jan 25, 2019 at 2:59 PM Karl Wright wrote: >> >> > I was using ancient Axis 1.4 and none of them were working. You can >> > exercise this with "ant classcreate-wsdls" in the csws directory. >> > >> > If you can give instructions for invoking CXF, maybe we can do that >> > instead. What's the main class, and what jars do we need to include? >> > >> > Karl >> > >> > >> > On Fri, Jan 25, 2019 at 7:28 AM Rafa Haro wrote: >> > >> >> Yes, I did. I have only tested Authentication service with Apache CXF >> and >> >> it was apparently working fine. Which ones were failing for you? >> >> >> >> On Fri, Jan 25, 2019 at 12:38 PM Karl Wright >> wrote: >> >> >> >>> Were you able to look at this yesterday at all? >> >>> Karl >> >>> >> >>> On Thu, Jan 24, 2019 at 6:34 AM Karl Wright >> wrote: >> >>> >> >>>> They're all checked in. >> >>>> >> >>>> See >> >>>> >> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1566/connectors/csws/wsdls >> >>>> >> >>>> Karl >> >>>> >> >>>> >> >>>> On Thu, Jan 24, 2019 at 6:24 AM Rafa Haro wrote: >> >>>> >> >>>>> Karl, can you share the WSDL, I can try to take a look later today >> >>>>> >> >>>>> On Thu, Jan 24, 2019 at 12:13 PM Karl Wright >> >>>>> wrote: >> >>>>> >> >>>>> > I'm redeveloping the Livelink connector because the API code has >> been >> >>>>> > discontinued and the only API is now web services based. The >> WSDLs >> >>>>> and >> >>>>> > XSDs have been exported and I'm trying to use the Axis tool >> >>>>> WSDL2Java to >> >>>>> > convert to Java code. Unfortunately, I haven't been able to make >> >>>>> this work >> >>>>> > -- even though the WSDLs references have been made local and the >> >>>>> XSDs also >> >>>>> > seem to be getting parsed, it complains about missing definitions, >> >>>>> even >> >>>>> > though those definitions are clearly present in the XSD files. >> >>>>> > >> >>>>> > Has anyone had enough experience with this tool, and web services >> in >> >>>>> > general, to figure out what's wrong? I've tried turning on as >> >>>>> verbose a >> >>>>> > debugging level for WSDL2Java as I can and it's no help at all. I >> >>>>> suspect >> >>>>> > namespace issues but I can't figure out what they are. >> >>>>> > >> >>>>> > Thanks in advance, >> >>>>> > Karl >> >>>>> > >> >>>>> >> >>>> >> >
Re: Axis question
Do you know what jar/maven package this is in? because I don't seem to have it in our normal cxf jars... Karl On Fri, Jan 25, 2019 at 9:08 AM Rafa Haro wrote: > I used a wsdl2java script that comes as an utility of the apache cxf > release, but basically is making use > of org.apache.cxf.tools.wsdlto.WSDLToJava class. You can find here an usage > example with ant: http://cxf.apache.org/docs/wsdl-to-java.html > > On Fri, Jan 25, 2019 at 2:59 PM Karl Wright wrote: > > > I was using ancient Axis 1.4 and none of them were working. You can > > exercise this with "ant classcreate-wsdls" in the csws directory. > > > > If you can give instructions for invoking CXF, maybe we can do that > > instead. What's the main class, and what jars do we need to include? > > > > Karl > > > > > > On Fri, Jan 25, 2019 at 7:28 AM Rafa Haro wrote: > > > >> Yes, I did. I have only tested Authentication service with Apache CXF > and > >> it was apparently working fine. Which ones were failing for you? > >> > >> On Fri, Jan 25, 2019 at 12:38 PM Karl Wright > wrote: > >> > >>> Were you able to look at this yesterday at all? > >>> Karl > >>> > >>> On Thu, Jan 24, 2019 at 6:34 AM Karl Wright > wrote: > >>> > >>>> They're all checked in. > >>>> > >>>> See > >>>> > https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1566/connectors/csws/wsdls > >>>> > >>>> Karl > >>>> > >>>> > >>>> On Thu, Jan 24, 2019 at 6:24 AM Rafa Haro wrote: > >>>> > >>>>> Karl, can you share the WSDL, I can try to take a look later today > >>>>> > >>>>> On Thu, Jan 24, 2019 at 12:13 PM Karl Wright > >>>>> wrote: > >>>>> > >>>>> > I'm redeveloping the Livelink connector because the API code has > been > >>>>> > discontinued and the only API is now web services based. The WSDLs > >>>>> and > >>>>> > XSDs have been exported and I'm trying to use the Axis tool > >>>>> WSDL2Java to > >>>>> > convert to Java code. Unfortunately, I haven't been able to make > >>>>> this work > >>>>> > -- even though the WSDLs references have been made local and the > >>>>> XSDs also > >>>>> > seem to be getting parsed, it complains about missing definitions, > >>>>> even > >>>>> > though those definitions are clearly present in the XSD files. > >>>>> > > >>>>> > Has anyone had enough experience with this tool, and web services > in > >>>>> > general, to figure out what's wrong? I've tried turning on as > >>>>> verbose a > >>>>> > debugging level for WSDL2Java as I can and it's no help at all. I > >>>>> suspect > >>>>> > namespace issues but I can't figure out what they are. > >>>>> > > >>>>> > Thanks in advance, > >>>>> > Karl > >>>>> > > >>>>> > >>>> >
Re: Axis question
The cxf stuff is already present, and is available in connector-common-lib as well, so all that might be needed might be a new ant rule to invoke it: 01/17/2019 05:47 PM 1,400,339 cxf-core-3.2.6.jar 01/17/2019 05:46 PM 181,690 cxf-rt-bindings-soap-3.2.6.jar 01/17/2019 05:46 PM38,307 cxf-rt-bindings-xml-3.2.6.jar 01/17/2019 05:46 PM 105,048 cxf-rt-databinding-jaxb-3.2.6.jar 01/17/2019 05:47 PM 680,120 cxf-rt-frontend-jaxrs-3.2.6.jar 01/17/2019 05:46 PM 346,308 cxf-rt-frontend-jaxws-3.2.6.jar 01/17/2019 05:46 PM 103,850 cxf-rt-frontend-simple-3.2.6.jar 01/17/2019 05:47 PM 179,790 cxf-rt-rs-client-3.2.6.jar 01/17/2019 05:47 PM 362,532 cxf-rt-transports-http-3.2.6.jar 01/17/2019 05:46 PM75,478 cxf-rt-ws-addr-3.2.6.jar 01/17/2019 05:46 PM 214,507 cxf-rt-ws-policy-3.2.6.jar 01/17/2019 05:46 PM 173,359 cxf-rt-wsdl-3.2.6.jar We'd also need XSD code generation, which is currently done by Castor (haven't even tried it yet), so if this package has that ability too, it would would be fantastic. Karl On Fri, Jan 25, 2019 at 8:59 AM Karl Wright wrote: > I was using ancient Axis 1.4 and none of them were working. You can > exercise this with "ant classcreate-wsdls" in the csws directory. > > If you can give instructions for invoking CXF, maybe we can do that > instead. What's the main class, and what jars do we need to include? > > Karl > > > On Fri, Jan 25, 2019 at 7:28 AM Rafa Haro wrote: > >> Yes, I did. I have only tested Authentication service with Apache CXF and >> it was apparently working fine. Which ones were failing for you? >> >> On Fri, Jan 25, 2019 at 12:38 PM Karl Wright wrote: >> >>> Were you able to look at this yesterday at all? >>> Karl >>> >>> On Thu, Jan 24, 2019 at 6:34 AM Karl Wright wrote: >>> >>>> They're all checked in. >>>> >>>> See >>>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1566/connectors/csws/wsdls >>>> >>>> Karl >>>> >>>> >>>> On Thu, Jan 24, 2019 at 6:24 AM Rafa Haro wrote: >>>> >>>>> Karl, can you share the WSDL, I can try to take a look later today >>>>> >>>>> On Thu, Jan 24, 2019 at 12:13 PM Karl Wright >>>>> wrote: >>>>> >>>>> > I'm redeveloping the Livelink connector because the API code has been >>>>> > discontinued and the only API is now web services based. The WSDLs >>>>> and >>>>> > XSDs have been exported and I'm trying to use the Axis tool >>>>> WSDL2Java to >>>>> > convert to Java code. Unfortunately, I haven't been able to make >>>>> this work >>>>> > -- even though the WSDLs references have been made local and the >>>>> XSDs also >>>>> > seem to be getting parsed, it complains about missing definitions, >>>>> even >>>>> > though those definitions are clearly present in the XSD files. >>>>> > >>>>> > Has anyone had enough experience with this tool, and web services in >>>>> > general, to figure out what's wrong? I've tried turning on as >>>>> verbose a >>>>> > debugging level for WSDL2Java as I can and it's no help at all. I >>>>> suspect >>>>> > namespace issues but I can't figure out what they are. >>>>> > >>>>> > Thanks in advance, >>>>> > Karl >>>>> > >>>>> >>>>
Re: Axis question
I was using ancient Axis 1.4 and none of them were working. You can exercise this with "ant classcreate-wsdls" in the csws directory. If you can give instructions for invoking CXF, maybe we can do that instead. What's the main class, and what jars do we need to include? Karl On Fri, Jan 25, 2019 at 7:28 AM Rafa Haro wrote: > Yes, I did. I have only tested Authentication service with Apache CXF and > it was apparently working fine. Which ones were failing for you? > > On Fri, Jan 25, 2019 at 12:38 PM Karl Wright wrote: > >> Were you able to look at this yesterday at all? >> Karl >> >> On Thu, Jan 24, 2019 at 6:34 AM Karl Wright wrote: >> >>> They're all checked in. >>> >>> See >>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1566/connectors/csws/wsdls >>> >>> Karl >>> >>> >>> On Thu, Jan 24, 2019 at 6:24 AM Rafa Haro wrote: >>> >>>> Karl, can you share the WSDL, I can try to take a look later today >>>> >>>> On Thu, Jan 24, 2019 at 12:13 PM Karl Wright >>>> wrote: >>>> >>>> > I'm redeveloping the Livelink connector because the API code has been >>>> > discontinued and the only API is now web services based. The WSDLs >>>> and >>>> > XSDs have been exported and I'm trying to use the Axis tool WSDL2Java >>>> to >>>> > convert to Java code. Unfortunately, I haven't been able to make >>>> this work >>>> > -- even though the WSDLs references have been made local and the XSDs >>>> also >>>> > seem to be getting parsed, it complains about missing definitions, >>>> even >>>> > though those definitions are clearly present in the XSD files. >>>> > >>>> > Has anyone had enough experience with this tool, and web services in >>>> > general, to figure out what's wrong? I've tried turning on as >>>> verbose a >>>> > debugging level for WSDL2Java as I can and it's no help at all. I >>>> suspect >>>> > namespace issues but I can't figure out what they are. >>>> > >>>> > Thanks in advance, >>>> > Karl >>>> > >>>> >>>
Re: Axis question
Were you able to look at this yesterday at all? Karl On Thu, Jan 24, 2019 at 6:34 AM Karl Wright wrote: > They're all checked in. > > See > https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1566/connectors/csws/wsdls > > Karl > > > On Thu, Jan 24, 2019 at 6:24 AM Rafa Haro wrote: > >> Karl, can you share the WSDL, I can try to take a look later today >> >> On Thu, Jan 24, 2019 at 12:13 PM Karl Wright wrote: >> >> > I'm redeveloping the Livelink connector because the API code has been >> > discontinued and the only API is now web services based. The WSDLs and >> > XSDs have been exported and I'm trying to use the Axis tool WSDL2Java to >> > convert to Java code. Unfortunately, I haven't been able to make this >> work >> > -- even though the WSDLs references have been made local and the XSDs >> also >> > seem to be getting parsed, it complains about missing definitions, even >> > though those definitions are clearly present in the XSD files. >> > >> > Has anyone had enough experience with this tool, and web services in >> > general, to figure out what's wrong? I've tried turning on as verbose a >> > debugging level for WSDL2Java as I can and it's no help at all. I >> suspect >> > namespace issues but I can't figure out what they are. >> > >> > Thanks in advance, >> > Karl >> > >> >
[jira] [Resolved] (CONNECTORS-1573) Web Crawler exclude from index matches too much?
[ https://issues.apache.org/jira/browse/CONNECTORS-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1573. - Resolution: Not A Problem > Web Crawler exclude from index matches too much? > > > Key: CONNECTORS-1573 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1573 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.10 >Reporter: Korneel Staelens >Priority: Major > > Hello, > I'm not sure this is a bug, or my misinterpretation of the exclusion rules: > I want to set-up a rule, so that it does NOT index a parentpage, but does > index all childpages of that parent: > I'm setting up a rule: > Inclusions: > .* > > Exclustions: > [http://www.website.com/nl/] > (I've tried also: http://www.website.com/nl/(\s)* ) > No dice, I'f I'm looking at the logs, I see the pages are crawled, but not > indexed due to job restriction. Is my rule wrong? Or is this a small bug? > > Thanks for advice! > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1573) Web Crawler exclude from index matches too much?
[ https://issues.apache.org/jira/browse/CONNECTORS-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751689#comment-16751689 ] Karl Wright commented on CONNECTORS-1573: - Questions like this should be asked to the us...@manifoldcf.apache.org list, not via a ticket. The quick answer: if you look at the simple history, you can tell whether the pages are fetched or not. If they are not fetched at all (that is, they do not appear), then your inclusion and exclusion list is wrong. That doesn't sound like it's the problem here; it sounds like *after* fetching it's being blocked. There are a number of reasons for that; the Simple History should give you a good idea which answer it is. If it reports "JOBDESCRIPTION", that means that the *indexing* inclusion/exclusion rule discarded it This is not the same as the *fetching* include/exclusion rules, which is what it sounds like you might be setting. They're on the same tabs, just farther down. The manual does not include the indexing rules sections; this should be addressed. I suspect that, based on the regexps you given, you're also overlooking the fact that if the regexp matches ANYWHERE in the URL it is considered a match. So if you want a very specific URL, you need to delimit it with ^ at the beginning and $ at the end, to insure that the entire URL matches and ONLY that URL. > Web Crawler exclude from index matches too much? > > > Key: CONNECTORS-1573 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1573 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.10 >Reporter: Korneel Staelens >Priority: Major > > Hello, > I'm not sure this is a bug, or my misinterpretation of the exclusion rules: > I want to set-up a rule, so that it does NOT index a parentpage, but does > index all childpages of that parent: > I'm setting up a rule: > Inclusions: > .* > > Exclustions: > [http://www.website.com/nl/] > (I've tried also: http://www.website.com/nl/(\s)* ) > No dice, I'f I'm looking at the logs, I see the pages are crawled, but not > indexed due to job restriction. Is my rule wrong? Or is this a small bug? > > Thanks for advice! > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Axis question
They're all checked in. See https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1566/connectors/csws/wsdls Karl On Thu, Jan 24, 2019 at 6:24 AM Rafa Haro wrote: > Karl, can you share the WSDL, I can try to take a look later today > > On Thu, Jan 24, 2019 at 12:13 PM Karl Wright wrote: > > > I'm redeveloping the Livelink connector because the API code has been > > discontinued and the only API is now web services based. The WSDLs and > > XSDs have been exported and I'm trying to use the Axis tool WSDL2Java to > > convert to Java code. Unfortunately, I haven't been able to make this > work > > -- even though the WSDLs references have been made local and the XSDs > also > > seem to be getting parsed, it complains about missing definitions, even > > though those definitions are clearly present in the XSD files. > > > > Has anyone had enough experience with this tool, and web services in > > general, to figure out what's wrong? I've tried turning on as verbose a > > debugging level for WSDL2Java as I can and it's no help at all. I > suspect > > namespace issues but I can't figure out what they are. > > > > Thanks in advance, > > Karl > > >
Axis question
I'm redeveloping the Livelink connector because the API code has been discontinued and the only API is now web services based. The WSDLs and XSDs have been exported and I'm trying to use the Axis tool WSDL2Java to convert to Java code. Unfortunately, I haven't been able to make this work -- even though the WSDLs references have been made local and the XSDs also seem to be getting parsed, it complains about missing definitions, even though those definitions are clearly present in the XSD files. Has anyone had enough experience with this tool, and web services in general, to figure out what's wrong? I've tried turning on as verbose a debugging level for WSDL2Java as I can and it's no help at all. I suspect namespace issues but I can't figure out what they are. Thanks in advance, Karl
Re: Do we support UTF-16 chars in version strings when using MySQL/MariaDB?
It's critical, with Manifold, that the database instance be capable of handling any characters it's likely to encounter. For Postgresql we tell people to install it with the utf-8 collation, for instance, and when we create database instances ourselves we try to specify that as well. For MariaDB, have a look at the database implementation we've got, and let me know if this is something we're missing anywhere? Thanks, Karl On Wed, Jan 23, 2019 at 3:00 AM Markus Schuch wrote: > Hi, > > while using MySQL/MariaDB for MCF i encountered a "deadlock" kind of > situation caused by a UTF-16 character (e.g. U+1F3AE) in a String > inserted in one of the varchar colums. > > In my case a connector wrote th title of a parent document in to the > version string of the process document, which contained the character > U+1F3AE - a gamepad :) > > This lead to SQL Error 22001 "Incorrect string value: '\xF0\x9F\x8E\xAE' > for column 'lastversion' at row 1" in mysql because the utf8 collation > encoding does not support that kind of chars. (utf8mb4 does) > > The cause was hard to find, because it somehow it lead to a transaction > abort loop in the incremental ingester and the error was not logged > properly. > > My question: > - should we create the mysql database with utf8mb4 by default? > - or should inserted strings be sanatized from UTF-16 chars? > - or should 22001 be handled better? > > Thanks in advance > Markus >
[jira] [Resolved] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1563. - Resolution: Not A Problem User has a configuration that makes no sense. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, managed-schema, manifold > settings.docx, manifoldcf.log, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1535) Documentum Connector cannot find dfc.properties
[ https://issues.apache.org/jira/browse/CONNECTORS-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1535. - Resolution: Fixed > Documentum Connector cannot find dfc.properties > --- > > Key: CONNECTORS-1535 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1535 > Project: ManifoldCF > Issue Type: Bug > Components: Documentum connector >Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11 > Environment: Manifold 2.11 > CentOS Linux release 7.5.1804 (Core) > OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode) > >Reporter: James Thomas >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > I have found that when installing a clean MCF instance I cannot get > Documentum repository connectors to connect to Documentum until I have added > this line to the processes/documentum-server/run.sh script before the call to > Java: > > {code:java} > CLASSPATH="$CLASSPATH""$PATHSEP""$DOCUMENTUM"{code} > Until I do this, attempts to save the connector will result in this output to > the console: > > {noformat} > 4 [RMI TCP Connection(2)-127.0.0.1] ERROR > com.documentum.fc.common.impl.preferences.PreferencesManager - > [DFC_PREFERENCE_LOAD_FAILED] Failed to load persistent preferences from null > java.io.FileNotFoundException: dfc.properties > at > com.documentum.fc.common.impl.preferences.PreferencesManager.locateMainPersistentStore(PreferencesManager.java:378) > at > com.documentum.fc.common.impl.preferences.PreferencesManager.readPersistentProperties(PreferencesManager.java:329) > at > com.documentum.fc.common.impl.preferences.PreferencesManager.(PreferencesManager.java:37) > at > com.documentum.fc.common.DfPreferences.initialize(DfPreferences.java:64) > ..{noformat} > and this message in the MCF UI: > > {noformat} > Connection failed: Documentum error: No DocBrokers are configured{noformat} > > > I mentioned this in #1512 for MCF 2.10 but it got lost in the other work done > in that ticket. While setting up 2.11 from scratch I encountered it again. > > Once I have edited the run.sh script I get this in the console, showing that > (for whatever reason) the change is significant: > > {noformat} > Reading DFC configuration from > "file:/opt/manifold/apache-manifoldcf-2.11/processes/documentum-server/dfc.properties" > {noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1535) Documentum Connector cannot find dfc.properties
[ https://issues.apache.org/jira/browse/CONNECTORS-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746262#comment-16746262 ] Karl Wright commented on CONNECTORS-1535: - [~jamesthomas], the registry process has no dependencies whatsoever on DFC, so any changes to this would be unnecessary. Last question: can the DFC properties location be provided as a -D switch parameter to the JVM? > Documentum Connector cannot find dfc.properties > --- > > Key: CONNECTORS-1535 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1535 > Project: ManifoldCF > Issue Type: Bug > Components: Documentum connector >Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11 > Environment: Manifold 2.11 > CentOS Linux release 7.5.1804 (Core) > OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode) > >Reporter: James Thomas >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > I have found that when installing a clean MCF instance I cannot get > Documentum repository connectors to connect to Documentum until I have added > this line to the processes/documentum-server/run.sh script before the call to > Java: > > {code:java} > CLASSPATH="$CLASSPATH""$PATHSEP""$DOCUMENTUM"{code} > Until I do this, attempts to save the connector will result in this output to > the console: > > {noformat} > 4 [RMI TCP Connection(2)-127.0.0.1] ERROR > com.documentum.fc.common.impl.preferences.PreferencesManager - > [DFC_PREFERENCE_LOAD_FAILED] Failed to load persistent preferences from null > java.io.FileNotFoundException: dfc.properties > at > com.documentum.fc.common.impl.preferences.PreferencesManager.locateMainPersistentStore(PreferencesManager.java:378) > at > com.documentum.fc.common.impl.preferences.PreferencesManager.readPersistentProperties(PreferencesManager.java:329) > at > com.documentum.fc.common.impl.preferences.PreferencesManager.(PreferencesManager.java:37) > at > com.documentum.fc.common.DfPreferences.initialize(DfPreferences.java:64) > ..{noformat} > and this message in the MCF UI: > > {noformat} > Connection failed: Documentum error: No DocBrokers are configured{noformat} > > > I mentioned this in #1512 for MCF 2.10 but it got lost in the other work done > in that ticket. While setting up 2.11 from scratch I encountered it again. > > Once I have edited the run.sh script I get this in the console, showing that > (for whatever reason) the change is significant: > > {noformat} > Reading DFC configuration from > "file:/opt/manifold/apache-manifoldcf-2.11/processes/documentum-server/dfc.properties" > {noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745422#comment-16745422 ] Karl Wright commented on CONNECTORS-1564: - [~michael-o] thanks for trying this. I await Erlend's more precise description of his setup. We are in fact setting up the HttpClientBuilder exactly as you recommend: {code} RequestConfig.Builder requestBuilder = RequestConfig.custom() .setCircularRedirectsAllowed(true) .setSocketTimeout(socketTimeout) .setExpectContinueEnabled(true) .setConnectTimeout(connectionTimeout) .setConnectionRequestTimeout(socketTimeout); HttpClientBuilder clientBuilder = HttpClients.custom() .setConnectionManager(connectionManager) .disableAutomaticRetries() .setDefaultRequestConfig(requestBuilder.build()) .setRedirectStrategy(new LaxRedirectStrategy()) .setRequestExecutor(new HttpRequestExecutor(socketTimeout)); {code} > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen > Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1535) Documentum Connector cannot find dfc.properties
[ https://issues.apache.org/jira/browse/CONNECTORS-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745418#comment-16745418 ] Karl Wright commented on CONNECTORS-1535: - Can you put dfc.properties in the same directory as the other DFC files and have it be found? > Documentum Connector cannot find dfc.properties > --- > > Key: CONNECTORS-1535 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1535 > Project: ManifoldCF > Issue Type: Bug > Components: Documentum connector >Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11 > Environment: Manifold 2.11 > CentOS Linux release 7.5.1804 (Core) > OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode) > >Reporter: James Thomas >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > I have found that when installing a clean MCF instance I cannot get > Documentum repository connectors to connect to Documentum until I have added > this line to the processes/documentum-server/run.sh script before the call to > Java: > > {code:java} > CLASSPATH="$CLASSPATH""$PATHSEP""$DOCUMENTUM"{code} > Until I do this, attempts to save the connector will result in this output to > the console: > > {noformat} > 4 [RMI TCP Connection(2)-127.0.0.1] ERROR > com.documentum.fc.common.impl.preferences.PreferencesManager - > [DFC_PREFERENCE_LOAD_FAILED] Failed to load persistent preferences from null > java.io.FileNotFoundException: dfc.properties > at > com.documentum.fc.common.impl.preferences.PreferencesManager.locateMainPersistentStore(PreferencesManager.java:378) > at > com.documentum.fc.common.impl.preferences.PreferencesManager.readPersistentProperties(PreferencesManager.java:329) > at > com.documentum.fc.common.impl.preferences.PreferencesManager.(PreferencesManager.java:37) > at > com.documentum.fc.common.DfPreferences.initialize(DfPreferences.java:64) > ..{noformat} > and this message in the MCF UI: > > {noformat} > Connection failed: Documentum error: No DocBrokers are configured{noformat} > > > I mentioned this in #1512 for MCF 2.10 but it got lost in the other work done > in that ticket. While setting up 2.11 from scratch I encountered it again. > > Once I have edited the run.sh script I get this in the console, showing that > (for whatever reason) the change is significant: > > {noformat} > Reading DFC configuration from > "file:/opt/manifold/apache-manifoldcf-2.11/processes/documentum-server/dfc.properties" > {noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1535) Documentum Connector cannot find dfc.properties
[ https://issues.apache.org/jira/browse/CONNECTORS-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1535: Fix Version/s: ManifoldCF 2.13 > Documentum Connector cannot find dfc.properties > --- > > Key: CONNECTORS-1535 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1535 > Project: ManifoldCF > Issue Type: Bug > Components: Documentum connector >Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11 > Environment: Manifold 2.11 > CentOS Linux release 7.5.1804 (Core) > OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode) > >Reporter: James Thomas >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > I have found that when installing a clean MCF instance I cannot get > Documentum repository connectors to connect to Documentum until I have added > this line to the processes/documentum-server/run.sh script before the call to > Java: > > {code:java} > CLASSPATH="$CLASSPATH""$PATHSEP""$DOCUMENTUM"{code} > Until I do this, attempts to save the connector will result in this output to > the console: > > {noformat} > 4 [RMI TCP Connection(2)-127.0.0.1] ERROR > com.documentum.fc.common.impl.preferences.PreferencesManager - > [DFC_PREFERENCE_LOAD_FAILED] Failed to load persistent preferences from null > java.io.FileNotFoundException: dfc.properties > at > com.documentum.fc.common.impl.preferences.PreferencesManager.locateMainPersistentStore(PreferencesManager.java:378) > at > com.documentum.fc.common.impl.preferences.PreferencesManager.readPersistentProperties(PreferencesManager.java:329) > at > com.documentum.fc.common.impl.preferences.PreferencesManager.(PreferencesManager.java:37) > at > com.documentum.fc.common.DfPreferences.initialize(DfPreferences.java:64) > ..{noformat} > and this message in the MCF UI: > > {noformat} > Connection failed: Documentum error: No DocBrokers are configured{noformat} > > > I mentioned this in #1512 for MCF 2.10 but it got lost in the other work done > in that ticket. While setting up 2.11 from scratch I encountered it again. > > Once I have edited the run.sh script I get this in the console, showing that > (for whatever reason) the change is significant: > > {noformat} > Reading DFC configuration from > "file:/opt/manifold/apache-manifoldcf-2.11/processes/documentum-server/dfc.properties" > {noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1535) Documentum Connector cannot find dfc.properties
[ https://issues.apache.org/jira/browse/CONNECTORS-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1535: --- Assignee: Karl Wright > Documentum Connector cannot find dfc.properties > --- > > Key: CONNECTORS-1535 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1535 > Project: ManifoldCF > Issue Type: Bug > Components: Documentum connector >Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11 > Environment: Manifold 2.11 > CentOS Linux release 7.5.1804 (Core) > OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode) > >Reporter: James Thomas >Assignee: Karl Wright >Priority: Major > > I have found that when installing a clean MCF instance I cannot get > Documentum repository connectors to connect to Documentum until I have added > this line to the processes/documentum-server/run.sh script before the call to > Java: > > {code:java} > CLASSPATH="$CLASSPATH""$PATHSEP""$DOCUMENTUM"{code} > Until I do this, attempts to save the connector will result in this output to > the console: > > {noformat} > 4 [RMI TCP Connection(2)-127.0.0.1] ERROR > com.documentum.fc.common.impl.preferences.PreferencesManager - > [DFC_PREFERENCE_LOAD_FAILED] Failed to load persistent preferences from null > java.io.FileNotFoundException: dfc.properties > at > com.documentum.fc.common.impl.preferences.PreferencesManager.locateMainPersistentStore(PreferencesManager.java:378) > at > com.documentum.fc.common.impl.preferences.PreferencesManager.readPersistentProperties(PreferencesManager.java:329) > at > com.documentum.fc.common.impl.preferences.PreferencesManager.(PreferencesManager.java:37) > at > com.documentum.fc.common.DfPreferences.initialize(DfPreferences.java:64) > ..{noformat} > and this message in the MCF UI: > > {noformat} > Connection failed: Documentum error: No DocBrokers are configured{noformat} > > > I mentioned this in #1512 for MCF 2.10 but it got lost in the other work done > in that ticket. While setting up 2.11 from scratch I encountered it again. > > Once I have edited the run.sh script I get this in the console, showing that > (for whatever reason) the change is significant: > > {noformat} > Reading DFC configuration from > "file:/opt/manifold/apache-manifoldcf-2.11/processes/documentum-server/dfc.properties" > {noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: SharedDriveConnector - jcifs.smb.SmbException: A device attached to the system is not functioning
That error is coming from the server you're trying to index. Sounds like some kind of hardware problem is being detected. Karl On Thu, Jan 17, 2019 at 2:17 AM wrote: > > > Hi, > > I've got a problem with the WindowsShares-Connector. > While indexing data on a filesever i get the following exception after > approximately 1 documents have been indexed. > > jcifs.smb.SmbException: A device attached to the system is not functioning. > at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563) ~ > [jcifs.jar:?] > at jcifs.smb.SmbTransport.send(SmbTransport.java:640) ~ > [jcifs.jar:?] > at jcifs.smb.SmbSession.send(SmbSession.java:238) ~[jcifs.jar:?] > at jcifs.smb.SmbTree.send(SmbTree.java:119) ~[jcifs.jar:?] > at jcifs.smb.SmbFile.send(SmbFile.java:775) ~[jcifs.jar:?] > at jcifs.smb.SmbFile.doFindFirstNext(SmbFile.java:1989) ~ > [jcifs.jar:?] > at jcifs.smb.SmbFile.doEnum(SmbFile.java:1741) ~[jcifs.jar:?] > at jcifs.smb.SmbFile.listFiles(SmbFile.java:1718) ~[jcifs.jar:?] > at jcifs.smb.SmbFile.listFiles(SmbFile.java:1707) ~[jcifs.jar:?] > at > > org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileListFiles > (SharedDriveConnector.java:2318) [mcf-jcifs-connector.jar:?] > > at > > org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments > (SharedDriveConnector.java:798) [mcf-jcifs-connector.jar:?] > > at org.apache.manifoldcf.crawler.system.WorkerThread.run > (WorkerThread.java:399) [mcf-pull-agent.jar:?] > ERROR 2019-01-16T08:48:00,172 (Worker thread '14') - JCIFS: SmbException > tossed processing smb://??.??.??.???/dir/dir/dir > > I'm using ManifoldCF 2.11 and jcifs-1.3.19.jar > > Do you have an idea what i could do or even a solution for it? > > Kind regards, > > Florjana > > > Der Austausch von Nachrichten via e-mail dient ausschließlich > Informationszwecken und ist nur für den Gebrauch des Empfängers bestimmt. > Rechtsgeschäftliche Erklärungen dürfen über dieses Medium nicht > ausgetauscht werden. Sollten Sie nicht der Adressat sein, verständigen Sie > uns bitte unverzüglich per e-mail oder Telefon und vernichten Sie diese > Nachricht. > > The exchange of e-mail messages is for purposes of information only and > only intended for the recipient. > This medium may not be used to exchange legal declarations. If you are not > the intended recipient, please contact us immediately by e-mail or phone > and delete this message from your system.
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743186#comment-16743186 ] Karl Wright commented on CONNECTORS-1564: - [~michael-o], any updates? > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen > Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743033#comment-16743033 ] Karl Wright commented on CONNECTORS-1563: - Please also see this discussion: https://issues.apache.org/jira/browse/CONNECTORS-1533 > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, managed-schema, manifold > settings.docx, manifoldcf.log, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743028#comment-16743028 ] Karl Wright commented on CONNECTORS-1563: - First, I asked for the Simple History, not the manifoldcf logs. What does the simple history say about document ingestions for the connection in question with the new configuration? But, from your solr log: {code} 2019-01-15 11:51:54.211 ERROR (qtp592617454-22) [ x:eesolr_webcrawler] o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234) {code} Note that the stack trace is from the ExtractingDocumentLoader, which is Tika. You did not manage to actually change the output handler to the non-extracting one, possibly because you have configured your Solr in a non-default way. I cannot debug that for you, sorry. Can you do the following: Download the current 7.x version of Solr, fresh, and extract it. Start it using the standard provided simple scripts. Point ManifoldCF at it and crawl some documents, using the setup for the connection I have described. Does that work? If it does, and I expect it to because that is what works for me here, then it is your job to figure out what you did to Solr to make that not work. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, managed-schema, manifold > settings.docx, manifoldcf.log, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743006#comment-16743006 ] Karl Wright commented on CONNECTORS-1563: - Please include [INFO] messages from the Solr log for example indexing requests, and also include records from the Simple History for documents indexed with the new configuration. Thanks. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: managed-schema, manifold settings.docx, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742949#comment-16742949 ] Karl Wright commented on CONNECTORS-1563: - Please view the Solr connection and click the button that tells it to forget about everything it has indexed. That will force reindexing. That's standard step when you change configuration like this and you want all documents to be reindexed. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: managed-schema, manifold settings.docx, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1570) ManifoldCF Documentum connetor crawling performance
[ https://issues.apache.org/jira/browse/CONNECTORS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16741333#comment-16741333 ] Karl Wright commented on CONNECTORS-1570: - Please ask your question on the us...@manifoldcf.apache.org list. In our experience, the performance of documentum itself is the bottleneck, and nothing can be done without optimizing for that. > ManifoldCF Documentum connetor crawling performance > --- > > Key: CONNECTORS-1570 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1570 > Project: ManifoldCF > Issue Type: Bug > Components: Documentum connector >Affects Versions: ManifoldCF 2.9.1 >Reporter: Gomahti >Priority: Major > > We are crawling data from DCTM repository using ManiFoldCF documentum > connector and writing the crawled data to MongoDB. Crawling triggered with > throttling value 500.But crawling speed is very slow per minute connector is > fetching only 170 documents. The server where MCF installed is configured > with enough memory with 8 logical cores (CPU). Can someone help us here to > improve crawling speed? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1570) ManifoldCF Documentum connetor crawling performance
[ https://issues.apache.org/jira/browse/CONNECTORS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1570. - Resolution: Not A Problem > ManifoldCF Documentum connetor crawling performance > --- > > Key: CONNECTORS-1570 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1570 > Project: ManifoldCF > Issue Type: Bug > Components: Documentum connector >Affects Versions: ManifoldCF 2.9.1 >Reporter: Gomahti >Priority: Major > > We are crawling data from DCTM repository using ManiFoldCF documentum > connector and writing the crawled data to MongoDB. Crawling triggered with > throttling value 500.But crawling speed is very slow per minute connector is > fetching only 170 documents. The server where MCF installed is configured > with enough memory with 8 logical cores (CPU). Can someone help us here to > improve crawling speed? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16741004#comment-16741004 ] Karl Wright commented on CONNECTORS-1563: - {quote} I need to pass from manifold one custom field and value which I want to see in Solr index. That is the reason why I used metadata transformer where I can pass the custom field in job - tab metadata adjuster. {quote} Yes, people do that all the time. Just add the Metadata Adjuster any place in your pipeline and have it inject the field value you want. It will be faithfully transmitted to Solr. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: managed-schema, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740587#comment-16740587 ] Karl Wright commented on CONNECTORS-1563: - The metadata extractor can go anywhere in your pipeline, after Tika extraction. There is absolutely no point in having *two* Tika extractions though -- and that's what you're trying to do with the setup you've got. What I'd recommend is that you use only the ManifoldCF-side Tika extractor, and inject content into Solr using the /update handler, not the /update/extract handler. There's also a checkbox you'd need to uncheck in the Solr connection configuration. It's all covered in the ManifoldCF end user documentation. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: managed-schema, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740435#comment-16740435 ] Karl Wright commented on CONNECTORS-1563: - {quote} Solr cell with standard update handler... {quote} This is not Option 2; it's a combination of (1) and (2) and is not a model that we support. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: managed-schema, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740330#comment-16740330 ] Karl Wright commented on CONNECTORS-1563: - Can you tell me which configuration you are attempting: (1) Solr Cell + extract update handler + no Tika content extraction in MCF, or (2) NO Solr Cell + standard update handler + Tika content extraction in MCF Which is it? > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: managed-schema, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1569) IBM WebSEAL authentication
[ https://issues.apache.org/jira/browse/CONNECTORS-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739541#comment-16739541 ] Karl Wright commented on CONNECTORS-1569: - I'm not sure what the best approach might be for this since almost everyone wants the expect-continue in place. It's essential, in fact, for authenticating properly via POST on many other systems. Adding a way of disabling this via the UI is plausible but it's significant work all around. Still, I think that would be the best approach to meet your needs. Unfortunately I'm already booked at least until March, so you may do best by trying to submit a patch that I can integrate and/or clean up. > IBM WebSEAL authentication > -- > > Key: CONNECTORS-1569 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1569 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifold 2.11 > IBM WebSEAL >Reporter: Ferdi Klomp >Assignee: Karl Wright >Priority: Major > Labels: ManifoldCF > > Hi, > We have stumbled upon a problem with the Web Connector authentication in > relation to IBM WebSEAL. We were unable to perform a successfully > authentication against WebSEAL. After some time debugging we figured out the > web connector sends out a "Expect:100 Continue" header and this is not > supported by WebSEAL. > [https://www-01.ibm.com/support/docview.wss?uid=swg21626421 > ]1. Disabling the "Expect:100 Continue" functionality by putting > setExpectedContinueEnabled to false in the "ThrottledFetcher.java" eventually > solved the problem. The exact line can be found here: > > [https://github.com/apache/manifoldcf/blob/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/ThrottledFetcher.java#L508] > I'm not sure if this option is required for other environment, or that it can > be disabled by default, or made configurable? > 2. Another option would be to make the timeout configurable, as the WebSEAL > docs state "The browser need to have some kind of timeout to to send the > request body before exceeding intra-connection-timeout.". By default, the web > connector request timeout exceeded the intra-connection-timeout of WebSEAL. > What is the best way to proceed and get a fixed for this in the web connector? > Kind regards, > Ferdi -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1569) IBM WebSEAL authentication
[ https://issues.apache.org/jira/browse/CONNECTORS-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1569: --- Assignee: Karl Wright > IBM WebSEAL authentication > -- > > Key: CONNECTORS-1569 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1569 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifold 2.11 > IBM WebSEAL >Reporter: Ferdi Klomp >Assignee: Karl Wright >Priority: Major > Labels: ManifoldCF > > Hi, > We have stumbled upon a problem with the Web Connector authentication in > relation to IBM WebSEAL. We were unable to perform a successfully > authentication against WebSEAL. After some time debugging we figured out the > web connector sends out a "Expect:100 Continue" header and this is not > supported by WebSEAL. > [https://www-01.ibm.com/support/docview.wss?uid=swg21626421 > ]1. Disabling the "Expect:100 Continue" functionality by putting > setExpectedContinueEnabled to false in the "ThrottledFetcher.java" eventually > solved the problem. The exact line can be found here: > > [https://github.com/apache/manifoldcf/blob/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/ThrottledFetcher.java#L508] > I'm not sure if this option is required for other environment, or that it can > be disabled by default, or made configurable? > 2. Another option would be to make the timeout configurable, as the WebSEAL > docs state "The browser need to have some kind of timeout to to send the > request body before exceeding intra-connection-timeout.". By default, the web > connector request timeout exceeded the intra-connection-timeout of WebSEAL. > What is the best way to proceed and get a fixed for this in the web connector? > Kind regards, > Ferdi -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738271#comment-16738271 ] Karl Wright commented on CONNECTORS-1562: - The "Stream has been closed" issue is occurring because it is simply taking too long to read all the data from the sitemap page, and the webserver is closing the connection before it's complete. Alternatively, it might be because the server is configured to cut pages off after a certain number of bytes. I don't know which one it is. You will need to do some research to figure out what your server's rules look like. The preferred solution would be to simply relax the rules for that one page. However, if that's not possible, the best alternative would be to break the sitemap page up into pieces. If each piece was, say 1/4 the size, it might be small enough to get past your current rules. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic > Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > image-2019-01-09-14-20-50-616.png, manifoldcf.log.cleanup, > manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)