[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything
[ https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826879#comment-16826879 ] Karl Wright commented on CONNECTORS-1602: - [~DonaldVdD] ManifoldCF keeps a queue of documents which it recrawls. The crawling is only completed when all the documents are no longer in a state where they need to be fetched. For a continuous job, all documents once fetched are immediately requeued, so this never happens. As for session-based login, if you set up your login sequence properly, so that when a document is fetched that needs a fresh cookie, the login will take place at that point and a new cookie will be used. > Continuous crawling doesn't recrawl everything > -- > > Key: CONNECTORS-1602 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1602 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Reporter: Donald Van den Driessche >Priority: Major > > When crawling a website in continuous crawling mode we saw that not all > documents are recrawled. > The site is quite extensive. We figured out that after crawling a > document/page gets a recrawl timestamp in between the recrawl interval and > max recrawl interval. > But if these values occur within the first crawl, Manifold starts recrawling > those, but seems to ignore the rest of the website. Also sometimes documents > get recrawled 5 times while other don't get recrawled. Apparently due to the > same issue. > > Is it possible to shed a bit more light on the continuous crawling? > Is it a good system to use for crawling a (extensive) website? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything
[ https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826725#comment-16826725 ] Karl Wright commented on CONNECTORS-1602: - Hi [~DonaldVdD], MCF keeps crude statistics on how often the doc changes. As I said, it gets recrawled *eventually*, and if it does not change, the time is doubled until the next crawl, up to the maximum the job is configured for. As for when the job "stops", the continuous crawl jobs do not stop. They run indefinitely until manually aborted. > Continuous crawling doesn't recrawl everything > -- > > Key: CONNECTORS-1602 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1602 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Reporter: Donald Van den Driessche >Priority: Major > > When crawling a website in continuous crawling mode we saw that not all > documents are recrawled. > The site is quite extensive. We figured out that after crawling a > document/page gets a recrawl timestamp in between the recrawl interval and > max recrawl interval. > But if these values occur within the first crawl, Manifold starts recrawling > those, but seems to ignore the rest of the website. Also sometimes documents > get recrawled 5 times while other don't get recrawled. Apparently due to the > same issue. > > Is it possible to shed a bit more light on the continuous crawling? > Is it a good system to use for crawling a (extensive) website? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything
[ https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826349#comment-16826349 ] Karl Wright commented on CONNECTORS-1602: - Continuous crawling bases the next crawl time on the last time the document changed. In general it doubles the crawling interval, up to the maximum, before retrying. So if your document doesn't change very often, the crawler may wait quite some time before reviewing it. The best way to see what it is going to do is to find the document in the Document Status report, and see when ManifoldCF intends to recrawl it. > Continuous crawling doesn't recrawl everything > -- > > Key: CONNECTORS-1602 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1602 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Reporter: Donald Van den Driessche >Priority: Major > > When crawling a website in continuous crawling mode we saw that not all > documents are recrawled. > The site is quite extensive. We figured out that after crawling a > document/page gets a recrawl timestamp in between the recrawl interval and > max recrawl interval. > But if these values occur within the first crawl, Manifold starts recrawling > those, but seems to ignore the rest of the website. Also sometimes documents > get recrawled 5 times while other don't get recrawled. Apparently due to the > same issue. > > Is it possible to shed a bit more light on the continuous crawling? > Is it a good system to use for crawling a (extensive) website? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1600) Add support for configuring JCIFS connector's resilience to SMB exceptions before throwing a ServiceInterruption
[ https://issues.apache.org/jira/browse/CONNECTORS-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823915#comment-16823915 ] Karl Wright commented on CONNECTORS-1600: - How many times did you have it retry? > Add support for configuring JCIFS connector's resilience to SMB exceptions > before throwing a ServiceInterruption > > > Key: CONNECTORS-1600 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1600 > Project: ManifoldCF > Issue Type: Improvement > Components: JCIFS connector >Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11, ManifoldCF 2.12 >Reporter: Tang Huan Song >Priority: Major > > This is a improvement request regarding the JCIFS(-ng) connector's exception > handling behavior. > After examining the JCIFS connector code, I've found that the number of > retries given consecutive identical SMB exceptions and the total number of > retries per file/request is hardcoded within the connector at > retriesRemaining=3 and totalTries=5 respectively. > Depending on the amount of traffic a file server regularly handles, the > probability of any given SMB request failing, and correspondingly the total > number of SMB request failures for a given file request will vary. As a > result, the current hardcoded values may cause ManifoldCF to abandon the job > in the event of high traffic. > I would like to suggest making these values configurable, as a connector-wide > setting modified via ManifoldCF's properties.xml or a per-connection setting > modified via the corresponding repository connection's page. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1498) Support SMBv2/v3 protocol for Windows Shares connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823800#comment-16823800 ] Karl Wright commented on CONNECTORS-1498: - Hi, The following settings were suggested by Michael Allen, original developer of Jcifs: {code} jcifs.smb.client.soTimeout: 15 jcifs.smb.client.responseTimeout: 12 jcifs.resolveOrder: LMHOSTS,DNS,WINS jcifs.smb.client.listCount: 20 jcifs.smb.client.dfs.strictView: true {code} The timeout values he chose were based on the behavior of the protocol, and were the maximum values he felt were reasonable given that. The resolve order was chosen based on his sense of how fast each one of these is. The listCount is also based on protocol considerations, and strictView is required for proper functioning of the connector. > Support SMBv2/v3 protocol for Windows Shares connector > -- > > Key: CONNECTORS-1498 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1498 > Project: ManifoldCF > Issue Type: Improvement > Components: JCIFS connector > Environment: OS: CentOS 7.2 > ManifoldCF: 2.8.1 >Reporter: Hiroaki Takasu >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.10 > > > Windows Shares connector (JCIFS connector) uses > [JCIFS|https://jcifs.samba.org/] library which supports only samba protocol > v1. > But many file servers were disabled SMBv1 by vulnerability > [MS17-010|https://docs.microsoft.com/en-us/security-updates/SecurityBulletins/2017/ms17-010], > so we can not use Windows Shares connector. > I hope that ManifoldCF support SMBv2/v3 with other CIFS library (e.g. > [smbj|https://github.com/hierynomus/smbj]) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1600) Add support for configuring JCIFS connector's resilience to SMB exceptions before throwing a ServiceInterruption
[ https://issues.apache.org/jira/browse/CONNECTORS-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1600. - Resolution: Won't Fix Not a good idea in my opinion > Add support for configuring JCIFS connector's resilience to SMB exceptions > before throwing a ServiceInterruption > > > Key: CONNECTORS-1600 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1600 > Project: ManifoldCF > Issue Type: Improvement > Components: JCIFS connector >Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11, ManifoldCF 2.12 >Reporter: Tang Huan Song >Priority: Major > > This is a improvement request regarding the JCIFS(-ng) connector's exception > handling behavior. > After examining the JCIFS connector code, I've found that the number of > retries given consecutive identical SMB exceptions and the total number of > retries per file/request is hardcoded within the connector at > retriesRemaining=3 and totalTries=5 respectively. > Depending on the amount of traffic a file server regularly handles, the > probability of any given SMB request failing, and correspondingly the total > number of SMB request failures for a given file request will vary. As a > result, the current hardcoded values may cause ManifoldCF to abandon the job > in the event of high traffic. > I would like to suggest making these values configurable, as a connector-wide > setting modified via ManifoldCF's properties.xml or a per-connection setting > modified via the corresponding repository connection's page. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1600) Add support for configuring JCIFS connector's resilience to SMB exceptions before throwing a ServiceInterruption
[ https://issues.apache.org/jira/browse/CONNECTORS-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823794#comment-16823794 ] Karl Wright commented on CONNECTORS-1600: - This is, I would suggest, overkill. It is not only a bad idea to make everything possible be configurable in the UI, but the benefit of doing so is small to non-existent. In 10 years of supporting this connector, what I've learned is that whenever people say that they just need more retries is that something else is very wrong, and the added retries do not help. If you ever run into a situation where 4 retries succeeds and 3 doesn't, please let me know. > Add support for configuring JCIFS connector's resilience to SMB exceptions > before throwing a ServiceInterruption > > > Key: CONNECTORS-1600 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1600 > Project: ManifoldCF > Issue Type: Improvement > Components: JCIFS connector >Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11, ManifoldCF 2.12 >Reporter: Tang Huan Song >Priority: Major > > This is a improvement request regarding the JCIFS(-ng) connector's exception > handling behavior. > After examining the JCIFS connector code, I've found that the number of > retries given consecutive identical SMB exceptions and the total number of > retries per file/request is hardcoded within the connector at > retriesRemaining=3 and totalTries=5 respectively. > Depending on the amount of traffic a file server regularly handles, the > probability of any given SMB request failing, and correspondingly the total > number of SMB request failures for a given file request will vary. As a > result, the current hardcoded values may cause ManifoldCF to abandon the job > in the event of high traffic. > I would like to suggest making these values configurable, as a connector-wide > setting modified via ManifoldCF's properties.xml or a per-connection setting > modified via the corresponding repository connection's page. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1498) Support SMBv2/v3 protocol for Windows Shares connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823745#comment-16823745 ] Karl Wright commented on CONNECTORS-1498: - Some of these settings should not be changed from a specific value. Other settings I could see changing on a per-connection basis. jcifs-ng does give the ability to control settings per connection, unlike the original jcifs. But I think that the ones that should be connection-based would be only: - jcifs.resolveOrder - jcifs.smb.client.minVersion - jcifs.smb.client.maxVersion - jcifs.smb.client.ipcSigningEnforced The first thing to do is verify that the ported code works properly though. > Support SMBv2/v3 protocol for Windows Shares connector > -- > > Key: CONNECTORS-1498 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1498 > Project: ManifoldCF > Issue Type: Improvement > Components: JCIFS connector > Environment: OS: CentOS 7.2 > ManifoldCF: 2.8.1 >Reporter: Hiroaki Takasu >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.10 > > > Windows Shares connector (JCIFS connector) uses > [JCIFS|https://jcifs.samba.org/] library which supports only samba protocol > v1. > But many file servers were disabled SMBv1 by vulnerability > [MS17-010|https://docs.microsoft.com/en-us/security-updates/SecurityBulletins/2017/ms17-010], > so we can not use Windows Shares connector. > I hope that ManifoldCF support SMBv2/v3 with other CIFS library (e.g. > [smbj|https://github.com/hierynomus/smbj]) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint
[ https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821823#comment-16821823 ] Karl Wright commented on CONNECTORS-1449: - Well, this is the method you are overriding, and its signature looks different from the one you propose: {code} GetListItemsResponseGetListItemsResult items = stub1.getListItems(docLibrary, "", q, viewFields, "1", buildNonPagingQueryOptions(), null); {code} docLibrary is the document library GUID, q is a GetListItemsQuery, viewFields is a GetListItemsViewFields which describes which fields to return, and the next argument is the listing options, and I don't know what the last argument is. > Add support for respecting the NoCrawl flag in Sharepoint > - > > Key: CONNECTORS-1449 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1449 > Project: ManifoldCF > Issue Type: New Feature > Components: SharePoint connector >Reporter: Markus Schuch >Assignee: Markus Schuch >Priority: Major > Fix For: ManifoldCF next > > > There is a flag {{NoCrawl}} in sharepoint that indicates whether an object > should be crawled or not: > Lists > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx > Web > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx > Field > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx > Wouldn't it be nice to respect that flag? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1498) Support SMBv2/v3 protocol for Windows Shares connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821570#comment-16821570 ] Karl Wright commented on CONNECTORS-1498: - Ok, I got past the build problems and made (what I think) are the necessary changes. Please check out https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1498 and try it. Thanks! > Support SMBv2/v3 protocol for Windows Shares connector > -- > > Key: CONNECTORS-1498 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1498 > Project: ManifoldCF > Issue Type: Improvement > Components: JCIFS connector > Environment: OS: CentOS 7.2 > ManifoldCF: 2.8.1 >Reporter: Hiroaki Takasu >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.10 > > > Windows Shares connector (JCIFS connector) uses > [JCIFS|https://jcifs.samba.org/] library which supports only samba protocol > v1. > But many file servers were disabled SMBv1 by vulnerability > [MS17-010|https://docs.microsoft.com/en-us/security-updates/SecurityBulletins/2017/ms17-010], > so we can not use Windows Shares connector. > I hope that ManifoldCF support SMBv2/v3 with other CIFS library (e.g. > [smbj|https://github.com/hierynomus/smbj]) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1498) Support SMBv2/v3 protocol for Windows Shares connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821562#comment-16821562 ] Karl Wright commented on CONNECTORS-1498: - [~hhoechtl], I just tried this. The compilation errors are the following: {code} compile-connector: [javac] C:\wip\mcf\trunk\dist\connector-build.xml:594: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 6 source files to C:\wip\mcf\trunk\connectors\jcifs\build\connector\classes [javac] C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:20: warning: [deprecation] NtlmPasswordAuthentication in jcifs.smb has been deprecated [javac] import jcifs.smb.NtlmPasswordAuthentication; [javac] ^ [javac] C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveHelpers.java:35: warning: [deprecation] NtlmPasswordAuthentication in jcifs.smb has been deprecated [javac] import jcifs.smb.NtlmPasswordAuthentication; [javac] ^ [javac] C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:19: error: cannot find symbol [javac] import jcifs.smb.ACE; [javac] ^ [javac] symbol: class ACE [javac] location: package jcifs.smb [javac] C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:20: warning: [deprecation] NtlmPasswordAuthentication in jcifs.smb has been deprecated [javac] import jcifs.smb.NtlmPasswordAuthentication; [javac] ^ [javac] C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:1246: error: cannot find symbol [javac] protected void convertACEs(List allowList, List denyList, ACE[] aces) [javac] ^ [javac] symbol: class ACE [javac] location: class SharedDriveConnector [javac] C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:2404: error: cannot find symbol [javac] protected static ACE[] getFileSecurity(SmbFile file, boolean useSIDs) [javac]^ [javac] symbol: class ACE [javac] location: class SharedDriveConnector [javac] C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:2442: error: cannot find symbol [javac] protected static ACE[] getFileShareSecurity(SmbFile file, boolean useSIDs) [javac]^ [javac] symbol: class ACE [javac] location: class SharedDriveConnector [javac] C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveHelpers.java:34: error: cannot find symbol [javac] import jcifs.smb.ACE; [javac] ^ [javac] symbol: class ACE [javac] location: package jcifs.smb [javac] C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveHelpers.java:35: warning: [deprecation] NtlmPasswordAuthentication in jcifs.smb has been deprecated [javac] import jcifs.smb.NtlmPasswordAuthentication; [javac] ^ [javac] 5 errors [javac] 4 warnings {code} The thing that stops it from compiling is that there is no longer a class jcifs.smb.ACE. This is an access-control list element. It's obviously critical to the functioning of ManifoldCF to have that. Can you research what happened to this in jcifs-ng? Did they just rename it, or did they completely remove it? > Support SMBv2/v3 protocol for Windows Shares connector > -- > > Key: CONNECTORS-1498 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1498 > Project: ManifoldCF > Issue Type: Improvement > Components: JCIFS connector > Environment: OS: CentOS 7.2 > ManifoldCF: 2.8.1 >Reporter: Hiroaki Takasu >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.10 > > > Windows Shares connector (JCIFS connector) uses > [JCIFS|https://jcifs.samba.org/] library which supports only samba protocol > v1. > But many file servers were disabled SMBv1 by vulnerability > [MS17-010|https://docs.microsoft.com/en-us/security-updates/SecurityBulletins/2017/ms17-010], > so we
[jira] [Commented] (CONNECTORS-1498) Support SMBv2/v3 protocol for Windows Shares connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821122#comment-16821122 ] Karl Wright commented on CONNECTORS-1498: - [~hhoechtl], this looks like a reimplementation of JCIFS which in theory should support the same original JCIFS API. It is LGPL, which is consistent with the original JCIFS, so I believe it includes much of the original code. It might well work by just downloading the jar and treating it just like the original JCIFS. Have you tried this? > Support SMBv2/v3 protocol for Windows Shares connector > -- > > Key: CONNECTORS-1498 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1498 > Project: ManifoldCF > Issue Type: Improvement > Components: JCIFS connector > Environment: OS: CentOS 7.2 > ManifoldCF: 2.8.1 >Reporter: Hiroaki Takasu >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.10 > > > Windows Shares connector (JCIFS connector) uses > [JCIFS|https://jcifs.samba.org/] library which supports only samba protocol > v1. > But many file servers were disabled SMBv1 by vulnerability > [MS17-010|https://docs.microsoft.com/en-us/security-updates/SecurityBulletins/2017/ms17-010], > so we can not use Windows Shares connector. > I hope that ManifoldCF support SMBv2/v3 with other CIFS library (e.g. > [smbj|https://github.com/hierynomus/smbj]) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint
[ https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817880#comment-16817880 ] Karl Wright commented on CONNECTORS-1449: - Yes, we could use the ListItems method to retrieve the NoCrawl flag for individual listed documents, provided this performs well. If individual metadata requests would need to be made for each listed document, then we'd be better off adding a new method that wraps the metadata retrieval method we're currently using from Lists and adding the NoCrawl attribute to the response. Does this make sense? > Add support for respecting the NoCrawl flag in Sharepoint > - > > Key: CONNECTORS-1449 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1449 > Project: ManifoldCF > Issue Type: New Feature > Components: SharePoint connector >Reporter: Markus Schuch >Assignee: Markus Schuch >Priority: Major > Fix For: ManifoldCF next > > > There is a flag {{NoCrawl}} in sharepoint that indicates whether an object > should be crawled or not: > Lists > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx > Web > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx > Field > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx > Wouldn't it be nice to respect that flag? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint
[ https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817462#comment-16817462 ] Karl Wright commented on CONNECTORS-1449: - We cannot, and should not, change Lists.aspx -- it's a Microsoft service. MCPermissions, though, wraps one of the methods in Lists.aspx and presents it as a MCPermissions method with fewer restrictions. Please look at the code for details. > Add support for respecting the NoCrawl flag in Sharepoint > - > > Key: CONNECTORS-1449 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1449 > Project: ManifoldCF > Issue Type: New Feature > Components: SharePoint connector >Reporter: Markus Schuch >Assignee: Markus Schuch >Priority: Major > Fix For: ManifoldCF next > > > There is a flag {{NoCrawl}} in sharepoint that indicates whether an object > should be crawled or not: > Lists > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx > Web > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx > Field > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx > Wouldn't it be nice to respect that flag? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint
[ https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816578#comment-16816578 ] Karl Wright commented on CONNECTORS-1449: - Hi, the method that is used to get the SOAP for metadata for a document is the following: {code} metadataValues = proxy.getFieldValues( sortedMetadataFields, encodePath(sitePath), listID, "/Lists/" + decodedItemPath.substring(cutoff+1), dspStsWorks ); {code} This calls: {code} { // SharePoint 2010: Get field values some other way // Sharepoint 2010; use Lists service instead ListsWS lservice = new ListsWS(baseUrl + site, userName, password, configuration, httpClient ); ListsSoapStub stub1 = (ListsSoapStub)lservice.getListsSoapHandler(); String sitePlusDocId = serverLocation + site + docId; if (sitePlusDocId.startsWith("/")) sitePlusDocId = sitePlusDocId.substring(1); GetListItemsQuery q = buildMatchQuery("FileRef","Text",sitePlusDocId); GetListItemsViewFields viewFields = buildViewFields(fieldNames); GetListItemsResponseGetListItemsResult items = stub1.getListItems(docLibrary, "", q, viewFields, "1", buildNonPagingQueryOptions(), null); if (items == null) return result; MessageElement[] list = items.get_any(); final String xmlResponse = list[0].toString(); if (Logging.connectors.isDebugEnabled()){ Logging.connectors.debug("SharePoint: getListItems FileRef value '"+sitePlusDocId+"', xml response: '" + xmlResponse + "'"); } {code} So it is calling the Lists service to do this right now (SharePoint 2010 and higher). For SharePoint 2003, it used the dspsts service, but that's been broken for a while, and I see no need to support this feature for that version of SharePoint. If you introduce a new service or method, I will also need a configuration switch that enables the code that calls it, or backwards compatibility will not be maintained. > Add support for respecting the NoCrawl flag in Sharepoint > - > > Key: CONNECTORS-1449 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1449 > Project: ManifoldCF > Issue Type: New Feature > Components: SharePoint connector >Reporter: Markus Schuch >Assignee: Markus Schuch >Priority: Major > Fix For: ManifoldCF next > > > There is a flag {{NoCrawl}} in sharepoint that indicates whether an object > should be crawled or not: > Lists > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx > Web > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx > Field > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx > Wouldn't it be nice to respect that flag? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint
[ https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816164#comment-16816164 ] Karl Wright commented on CONNECTORS-1449: - The MCPermissions plugin at present furnishes two services: one to get permissions for users, and the other to list documents without restrictions imposed by SharePoint. I would propose (if either the dspsts, webs, or versions services do not handle this themselves) that we either add a new MCPermissions service that wraps whatever is currently used to obtain document metadata with one that also adds the "NoCrawl" flag to the result, OR we put it in the existing Lists service wrapper we currently have. Note that the problem isn't going to be adequately addressed unless we can get this information on a per-document basis, somehow. We need to be able to tell the framework to delete the document when the connector looks at it. Doing this in a transformation connector won't work for that very same reason: the document won't be sent to the transformer unless it's noticed to have been changed in some way. So the repository connector really has to handle this. > Add support for respecting the NoCrawl flag in Sharepoint > - > > Key: CONNECTORS-1449 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1449 > Project: ManifoldCF > Issue Type: New Feature > Components: SharePoint connector >Reporter: Markus Schuch >Assignee: Markus Schuch >Priority: Major > Fix For: ManifoldCF next > > > There is a flag {{NoCrawl}} in sharepoint that indicates whether an object > should be crawled or not: > Lists > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx > Web > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx > Field > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx > Wouldn't it be nice to respect that flag? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint
[ https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816017#comment-16816017 ] Karl Wright commented on CONNECTORS-1449: - It depends on why you want to avoid crawling something. If it's to prevent fetching it then you can't do it at the transformer level. But the right solution is to look for it in the SOAP response. There is another solution, which is to modify the ManifoldCF SharePoint plugin for SharePoint 2013 to return it from the modified Lists service. That would involve C# code changes, but would definitely allow us access to the flag in the connector. The code is checked in under https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2013/trunk . Have a look. > Add support for respecting the NoCrawl flag in Sharepoint > - > > Key: CONNECTORS-1449 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1449 > Project: ManifoldCF > Issue Type: New Feature > Components: SharePoint connector >Reporter: Markus Schuch >Assignee: Markus Schuch >Priority: Major > Fix For: ManifoldCF next > > > There is a flag {{NoCrawl}} in sharepoint that indicates whether an object > should be crawled or not: > Lists > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx > Web > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx > Field > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx > Wouldn't it be nice to respect that flag? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint
[ https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815765#comment-16815765 ] Karl Wright commented on CONNECTORS-1449: - [~durai-jira] Rewriting the SharePoint connector to use the REST API is well beyond the scope of a bug fix of this kind. If you want to attempt this work and contribute it to ManifoldCF, I can certainly *help*, but I cannot do something of this scale myself, while working full time on other things, without even a SharePoint instance to work with. > Add support for respecting the NoCrawl flag in Sharepoint > - > > Key: CONNECTORS-1449 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1449 > Project: ManifoldCF > Issue Type: New Feature > Components: SharePoint connector >Reporter: Markus Schuch >Assignee: Markus Schuch >Priority: Major > Fix For: ManifoldCF next > > > There is a flag {{NoCrawl}} in sharepoint that indicates whether an object > should be crawled or not: > Lists > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx > Web > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx > Field > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx > Wouldn't it be nice to respect that flag? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1599) response code 401 still gets deleted with the setting "keep unreachable documents"
[ https://issues.apache.org/jira/browse/CONNECTORS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1599. - Resolution: Not A Problem Assignee: Karl Wright Fix Version/s: ManifoldCF 2.13 Working as designed. > response code 401 still gets deleted with the setting "keep unreachable > documents" > -- > > Key: CONNECTORS-1599 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1599 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > Even with the "Hop count mode" set to "keep unreachable documents, 'for now' > || forever" manifold deletes documents for which it receives a 401 response > code. > The documentation does not specify such a distinction as described above. Is > there some information/configuration that I'm missing? Is there a reasoning > behind the guaranteed deletion of a 401? > Ideally, for our use-case, we would want to remove all documents that return > 404, but keep everything which is due the server not responding or the > crawler being unauthenticated. > Is there a way to configure this in a more granular fashion? > Regards, > roel -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1599) response code 401 still gets deleted with the setting "keep unreachable documents"
[ https://issues.apache.org/jira/browse/CONNECTORS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815389#comment-16815389 ] Karl Wright commented on CONNECTORS-1599: - [~goovaertsr], 401 means the document is not accessible. This has nothing to do with being "unreachable", because "unreachable" means there is no path to it from the seeds. > response code 401 still gets deleted with the setting "keep unreachable > documents" > -- > > Key: CONNECTORS-1599 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1599 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Major > > Even with the "Hop count mode" set to "keep unreachable documents, 'for now' > || forever" manifold deletes documents for which it receives a 401 response > code. > The documentation does not specify such a distinction as described above. Is > there some information/configuration that I'm missing? Is there a reasoning > behind the guaranteed deletion of a 401? > Ideally, for our use-case, we would want to remove all documents that return > 404, but keep everything which is due the server not responding or the > crawler being unauthenticated. > Is there a way to configure this in a more granular fashion? > Regards, > roel -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint
[ https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815388#comment-16815388 ] Karl Wright commented on CONNECTORS-1449: - [~durai-jira], I do not have a sharepoint instance here. Nor do I have any indication that any of the web services included with SharePoint provide the value for this flag. If you can show me how the standard services return the flag value, I can easily implement this. Please let me know. > Add support for respecting the NoCrawl flag in Sharepoint > - > > Key: CONNECTORS-1449 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1449 > Project: ManifoldCF > Issue Type: New Feature > Components: SharePoint connector >Reporter: Markus Schuch >Assignee: Markus Schuch >Priority: Major > Fix For: ManifoldCF next > > > There is a flag {{NoCrawl}} in sharepoint that indicates whether an object > should be crawled or not: > Lists > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx > Web > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx > Field > https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx > Wouldn't it be nice to respect that flag? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job
[ https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814326#comment-16814326 ] Karl Wright commented on CONNECTORS-1592: - [~goovaertsr] Yes, if you have no intention of doing hopcount filtering ever, then disable hop count filtering forever. It's far easier on the database. Having said that, I'm pretty sure you have other problems too. > Found long running query in manifold scheduled job > -- > > Key: CONNECTORS-1592 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1592 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.12 >Reporter: Subasini Rath >Priority: Major > Attachments: LongRunningWithPlan_thread39.txt, > SELECT_blocked_queries.txt, postgresql.conf, properties.xml > > > Hi Karl, > I am also facing the above mentioned issue. (Similar to Connector-880) > I am using manifold2.12 binary version. I am using Solr output connector and > Web repository connection. Manifold is using all default configuration. > When I am running the jobs manually, it runs fine. Same jobs have been > scheduled to run everyday. > I am getting below exceptions and the job gets hanged/ going to waiting stage. > Could you please help me in resolving the same. > I am getting the below error - > Scenario-1 > WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query > (2706114 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a > long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query > (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)] > WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: > 'e' > WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running > query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE > needpriority=? LIMIT 1000] > WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i' > WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062' > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a > long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter > 0: 'S' > WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query > (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE] > WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A' > WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W' > WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R' > WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running > query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE] > WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Parameter 0: 'E' > WARN 2019-03-08T23:58:20,483 (qtp550147359-4339) - Found a long-running > query (2496641 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: > isDistinctSelect=[false] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isGrouped=[false] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isAggregated=[false] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: columns=[ COLUMN: > PUBLIC.JOBS.ID not nullable > WARN 2019-03-08T23:58:20,492 (qtp550147359-4346) - Found a long-running > query (2435908 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: ] > WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: [range variable 1 > WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: join type=INNER > WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: table=SYSTEM_SUBQUERY > WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: cardinality=0 > WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: access=FULL SCAN > WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: join condition = > [index=SYS_IDX_13329 > WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: ] > WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: ][range variable 2 > WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: join
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814310#comment-16814310 ] Karl Wright commented on CONNECTORS-1593: - Hi [~DonaldVdD], what connector is being used to download the files? What is serving them? Having the data get corrupted is very very odd; I can't imagine have code that does that accidentally. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: image-2019-03-22-08-57-53-887.png > > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at >
[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job
[ https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814306#comment-16814306 ] Karl Wright commented on CONNECTORS-1592: - {quote} the largest was 223673ms, the minimum time spent was 172416ms, the others are distributed between these extrema {quote} I saw a longer-running query than that in the log you posted, some 200ms. But the plan was fine. Once again, locking would have been the only explanation. But if you are seeing no queries running in less than 172416ms, then I think you may well have found your problem. The lion's share of Postgresql queries should be executing in well under a second. Times around 20ms would be typical. Something is very wrong with your Postgresql configuration or installation given that. {quote} Just one more question, considering what you said of the hopcount filtering; In the "Hop Filters"-tab we have nothing of configuration except for "hop count mode" is set to "delete unreachable", which i had interpreted as being the default. Is this correct that it is the default, and is there something else we could do to disable hop count filtering? {quote} That is the default; it's also the most inefficient. From the manual: {quote} On this same tab, you can tell the Framework what to do should there be changes in the distance from the root to a document. The choice "Delete unreachable documents" requires the Framework to recalculate the distance to every potentially affected document whenever a change takes place. This may require expensive bookkeeping, however, so you also have the option of ignoring such changes. There are two varieties of this latter option - you can ignore the changes for now, with the option of turning back on the aggressive bookkeeping at a later time, or you can decide not to ever allow changes to propagate, in which case the Framework will discard the necessary bookkeeping information permanently. This last option is the most efficient. {quote} > Found long running query in manifold scheduled job > -- > > Key: CONNECTORS-1592 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1592 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.12 >Reporter: Subasini Rath >Priority: Major > Attachments: LongRunningWithPlan_thread39.txt, > SELECT_blocked_queries.txt, postgresql.conf, properties.xml > > > Hi Karl, > I am also facing the above mentioned issue. (Similar to Connector-880) > I am using manifold2.12 binary version. I am using Solr output connector and > Web repository connection. Manifold is using all default configuration. > When I am running the jobs manually, it runs fine. Same jobs have been > scheduled to run everyday. > I am getting below exceptions and the job gets hanged/ going to waiting stage. > Could you please help me in resolving the same. > I am getting the below error - > Scenario-1 > WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query > (2706114 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a > long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query > (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)] > WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: > 'e' > WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running > query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE > needpriority=? LIMIT 1000] > WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i' > WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062' > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a > long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter > 0: 'S' > WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query > (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE] > WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A' > WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W' > WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R' > WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running > query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE] > WARN 2019-03-08T23:58:20,475 (Delete
[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job
[ https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813400#comment-16813400 ] Karl Wright commented on CONNECTORS-1592: - {quote} one 'root' query doesn't get committed, this keeps a lock on the job-, intrinsiclink- or jobQueue-table and cascades into the bulk of locked queries. Main question here is how one query could get stuck; can a query be waiting for something from manifold until it is committed? {quote} That is certainly possible, but you should see that query logged as a very-long-running query in that case. What is the longest-running query you see logged? {quote} there is a locking conflict that arises from the jobID being a foreing key constraint for both the jobQueue and intrinsiclinks. From debugging we have the impression that postgres locks the whole intrinsiclink-table in a query which is specified to have one specific jobId. {quote} It may do that, but the way Postgresql works is then a SQL exception is thrown, and the ManifoldCF code will retry the query. So this situation cannot cause the symptom you are seeing. The *only* way you can get into this situation is to have one particular query, which hits tables that all the other queries depend on, take a very long time. And that should show in the log. If it doesn't show in the log, that means that the locks are being caused externally, which is why I pointed at VACUUM FULL as being a potential cause. {quote} could using the multi process-functionality of org.apache.manifoldcf.usejettyparentclassloader be used to improve this issue? {quote} No, won't help. {quote} I have read that disabling swap can be good for intensive db-interactions; do you have experience with disabling swap improving manifold? {quote} Once again, probably immaterial, EXCEPT if your postgresql instances are swapping. That would be bad. {quote} is there a possibility that we could set-up a conference call with someone from the manifold team? {quote} I work full time on an entirely unrelated task and probably there is nobody else who would be in a position to go deep on this issue. So this is unlikely. One thing I notice, though, is that you are seeing a lot of intrinsiclink activity. If you are not using hopcount filtering, you can disable that entirely at the job level. It might help you (can't be sure until the blocking culprit is found though). > Found long running query in manifold scheduled job > -- > > Key: CONNECTORS-1592 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1592 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.12 >Reporter: Subasini Rath >Priority: Major > Attachments: LongRunningWithPlan_thread39.txt, > SELECT_blocked_queries.txt, postgresql.conf, properties.xml > > > Hi Karl, > I am also facing the above mentioned issue. (Similar to Connector-880) > I am using manifold2.12 binary version. I am using Solr output connector and > Web repository connection. Manifold is using all default configuration. > When I am running the jobs manually, it runs fine. Same jobs have been > scheduled to run everyday. > I am getting below exceptions and the job gets hanged/ going to waiting stage. > Could you please help me in resolving the same. > I am getting the below error - > Scenario-1 > WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query > (2706114 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a > long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query > (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)] > WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: > 'e' > WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running > query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE > needpriority=? LIMIT 1000] > WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i' > WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062' > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a > long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter > 0: 'S' > WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query > (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR
[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job
[ https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809709#comment-16809709 ] Karl Wright commented on CONNECTORS-1592: - [~goovaertsr], this is a perfectly fine plan, as you see in the execution estimates here: {code} WARN 2019-04-03T14:09:04,328 (Worker thread '39') - Plan: Planning time: 0.706 ms WARN 2019-04-03T14:09:04,328 (Worker thread '39') - Plan: Execution time: 0.382 ms {code} And yet the time (in this case) is 2 seconds for execution, which is still not bad actually, given that MCF is pounding on the database. As I said before, there is no indication of actual bad plans. Instead, the database as a whole is going offline or is being locked down for an extended period of time. > Found long running query in manifold scheduled job > -- > > Key: CONNECTORS-1592 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1592 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.12 >Reporter: Subasini Rath >Priority: Major > Attachments: LongRunningWithPlan_thread39.txt > > > Hi Karl, > I am also facing the above mentioned issue. (Similar to Connector-880) > I am using manifold2.12 binary version. I am using Solr output connector and > Web repository connection. Manifold is using all default configuration. > When I am running the jobs manually, it runs fine. Same jobs have been > scheduled to run everyday. > I am getting below exceptions and the job gets hanged/ going to waiting stage. > Could you please help me in resolving the same. > I am getting the below error - > Scenario-1 > WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query > (2706114 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a > long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query > (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)] > WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: > 'e' > WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running > query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE > needpriority=? LIMIT 1000] > WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i' > WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062' > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a > long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter > 0: 'S' > WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query > (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE] > WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A' > WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W' > WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R' > WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running > query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE] > WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Parameter 0: 'E' > WARN 2019-03-08T23:58:20,483 (qtp550147359-4339) - Found a long-running > query (2496641 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: > isDistinctSelect=[false] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isGrouped=[false] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isAggregated=[false] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: columns=[ COLUMN: > PUBLIC.JOBS.ID not nullable > WARN 2019-03-08T23:58:20,492 (qtp550147359-4346) - Found a long-running > query (2435908 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: ] > WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: [range variable 1 > WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: join type=INNER > WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: table=SYSTEM_SUBQUERY > WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: cardinality=0 > WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: access=FULL SCAN >
[jira] [Commented] (CONNECTORS-1598) session based authentication cannot register 401
[ https://issues.apache.org/jira/browse/CONNECTORS-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808808#comment-16808808 ] Karl Wright commented on CONNECTORS-1598: - Hi [~goovaertsr], if your site authentication involves cookies, you really need to have session-based authentication set up. Furthermore, you do NOT want to include the login page in your seed list. Instead, you want to set up a login sequence (which is attached to a specific set of URLs that define the session-protected part of your site), which will be triggered if the login needs to be done. What session-based Auth does is the following: - detects when accessing a content page fails because of missing session login - provides a way of walking through the session login page sequence that sets the cookies - retries the content page fetch with the correct cookies that have been logged in It is therefore critical to configure the session-based access so that it properly detects when an invalid, missing, or expired session cookie is detected by the site you are crawling. If you've already read the end-user documentation about this, then this should be clear. If I understand your problem, your site does not redirect to a login page when there is no session cookie: it simply returns a 401. That's not a very typical flow for session-based code, but you should nevertheless be able to match specific page contents associated with the 401 response. In HTTP, all response codes can have content, and 401 is no different, so I assume this is the case? To answer your question: {quote} is 4xx tied to page base authentication and 3xx tied to session based authentication? {quote} 4xx responses are handled only as page-based authentication. That is the meaning of 4xx responses typically in HTTP. > session based authentication cannot register 401 > > > Key: CONNECTORS-1598 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1598 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Major > > Description: > Access to a specific domain is restricted by being A) an intranet service B) > based on an employee/costumer profile. > For manifold to be able to be authenticated there is a specific > '\{{domain}}/login' page with a form where manifold was configured to enter > it's username and password. A session-cookie is then set so manifold is > authenticated to access all resources. If a request for a resource is not > authenticated the service throws a 401. When the service returns a 401 the > actual content of the resource includes the same form as is present in > '\{{domain}}/login'. > Problem: > The only way we have been able to configure manifold to be authenticated was > by specifying session-based credentials AND providing '\{{domain}}/login' as > a seed in the job as well. The only other seed in the job is a sitemap. > This is of course not ideal since it can easily happen that the seed for the > sitemap gets processed first, which then throws a 401 on the sitemap and the > job stops. > Another possible scenario with this configuration is that the cookie expires > and all other resources throw 401 and get deleted from the index > (elasticsearch). There is also another job (different language, same domain), > usage of the cookie from the previous job has also been registered. > Current session-based access credentials configuration: > --url regular expression : https://\{{domain}}/ > --login pages: > ---login url regexp : 'login' > ---page type : form > ---identification regexp is set to match the form-name > ---form parameters are filled with the correct parameters > This is verified to work, but as my understanding this only works because the > login-page is part of the seeds and so it matches the url when it comes > across it when crawling. There is no configuration yet which redirects (for > example) to this page when manifold receives a 401. > My goal was then to remove the login-page from the seeds and configure the > job so that each time a fetch returns a 401, manifold knows to go to the > login page. in pseudo code: > --If authenticated > ---process > --else > ---redirect to login > ---retry resource > > Based on the documentation here: > https://manifoldcf.apache.org/release/release-2.12/en_US/end-user-documentation.html#webrepository > I tried a few different configurations. The first thing to notice is in the > comparison table, 'page based authentication' only mentions 4xx and 'session > based authentication' only mentions 3xx. > At this time my biggest question is; are these response codes bound to the > difference in settings between page and session based? As far I have been > able
[jira] [Comment Edited] (CONNECTORS-1592) Found long running query in manifold scheduled job
[ https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808652#comment-16808652 ] Karl Wright edited comment on CONNECTORS-1592 at 4/3/19 12:08 PM: -- [~goovaertsr], when a large number of queries are blocked and do not get executed for a while (in this case 2706114 ms or so), then when all of them finally fire they are all reported as "slow running queries". The question is: why are all of these queries blocked? Tuple bloat just makes the database generally get slower and slower, so that is not it. If you execute "VACUUM FULL" while ManifoldCF is running, that *could* do it, since tables get completely locked one at a time and are recreated. It is recommended that you either shut ManifoldCF down during this time, or create a "signalling file" which tells ManifoldCF to not do any real work until it goes away. Your choice. If you want to know more about the latter option, please let me know. If this isn't due to a concurrent "VACUUM FULL", then we're left with finding some other cause. While it is taking place, there may be a way of getting Postgresql's state across all requests; that would be the ideal way to figure it out. FWIW, the following query should be instantaneous, which is why it appears to me that the whole database is locked down somehow: {code} WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query (2706114 ms): [SELECT t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs t0 ORDER BY description ASC] {code} was (Author: kwri...@metacarta.com): [~goovaertsr], when a large number of queries are blocked and do not get executed for a while (in this case 132000ms or so), then when all of them finally fire they are all reported as "slow running queries". The question is: why are all of these queries blocked? Tuple bloat just makes the database generally get slower and slower, so that is not it. If you execute "VACUUM FULL" while ManifoldCF is running, that *could* do it, since tables get completely locked one at a time and are recreated. It is recommended that you either shut ManifoldCF down during this time, or create a "signalling file" which tells ManifoldCF to not do any real work until it goes away. Your choice. If you want to know more about the latter option, please let me know. If this isn't due to a concurrent "VACUUM FULL", then we're left with finding some other cause. While it is taking place, there may be a way of getting Postgresql's state across all requests; that would be the ideal way to figure it out. > Found long running query in manifold scheduled job > -- > > Key: CONNECTORS-1592 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1592 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.12 >Reporter: Subasini Rath >Priority: Major > > Hi Karl, > I am also facing the above mentioned issue. (Similar to Connector-880) > I am using manifold2.12 binary version. I am using Solr output connector and > Web repository connection. Manifold is using all default configuration. > When I am running the jobs manually, it runs fine. Same jobs have been > scheduled to run everyday. > I am getting below exceptions and the job gets hanged/ going to waiting stage. > Could you please help me in resolving the same. > I am getting the below error - > Scenario-1 > WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query > (2706114 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a > long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query > (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)] > WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: > 'e' > WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running > query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE > needpriority=? LIMIT 1000] > WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i' > WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062' > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a > long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter > 0: 'S' > WARN 2019-03-08T23:58:20,474 (Finisher thread)
[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job
[ https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808652#comment-16808652 ] Karl Wright commented on CONNECTORS-1592: - [~goovaertsr], when a large number of queries are blocked and do not get executed for a while (in this case 132000ms or so), then when all of them finally fire they are all reported as "slow running queries". The question is: why are all of these queries blocked? Tuple bloat just makes the database generally get slower and slower, so that is not it. If you execute "VACUUM FULL" while ManifoldCF is running, that *could* do it, since tables get completely locked one at a time and are recreated. It is recommended that you either shut ManifoldCF down during this time, or create a "signalling file" which tells ManifoldCF to not do any real work until it goes away. Your choice. If you want to know more about the latter option, please let me know. If this isn't due to a concurrent "VACUUM FULL", then we're left with finding some other cause. While it is taking place, there may be a way of getting Postgresql's state across all requests; that would be the ideal way to figure it out. > Found long running query in manifold scheduled job > -- > > Key: CONNECTORS-1592 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1592 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.12 >Reporter: Subasini Rath >Priority: Major > > Hi Karl, > I am also facing the above mentioned issue. (Similar to Connector-880) > I am using manifold2.12 binary version. I am using Solr output connector and > Web repository connection. Manifold is using all default configuration. > When I am running the jobs manually, it runs fine. Same jobs have been > scheduled to run everyday. > I am getting below exceptions and the job gets hanged/ going to waiting stage. > Could you please help me in resolving the same. > I am getting the below error - > Scenario-1 > WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query > (2706114 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a > long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query > (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)] > WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: > 'e' > WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running > query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE > needpriority=? LIMIT 1000] > WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I' > WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i' > WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062' > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a > long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1] > WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter > 0: 'S' > WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query > (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE] > WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A' > WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W' > WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R' > WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running > query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE] > WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Parameter 0: 'E' > WARN 2019-03-08T23:58:20,483 (qtp550147359-4339) - Found a long-running > query (2496641 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: > isDistinctSelect=[false] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isGrouped=[false] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isAggregated=[false] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: columns=[ COLUMN: > PUBLIC.JOBS.ID not nullable > WARN 2019-03-08T23:58:20,492 (qtp550147359-4346) - Found a long-running > query (2435908 ms): [SELECT > t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs > t0 ORDER BY description ASC] > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: > WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: ] > WARN
[jira] [Assigned] (CONNECTORS-1595) cross-site request forgery vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1595: --- Assignee: Kishore Kumar > cross-site request forgery vulnerability > > > Key: CONNECTORS-1595 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1595 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Assignee: Kishore Kumar >Priority: Minor > > Below is the full analysis and description as a result from the penetration > test. > *Summary* > The application is vulnerable to Cross-Site Request Forgery (CSRF). > A cross-site request forgery attack uses the following scenario: > 1. An attacker creates a web page that includes an image or a form pointing > to the attacked application. > The image source would actually be a URL with parameters pointing to the > application page that > performs some action. In case of a form, the form action would point to the > action page in the target > application, and the form is submitted automatically by JavaScript when the > page is viewed. > 2. The attacker tricks the victim user to browse to this page. The attacker > may get the victim to click a > link, or embed the attacking HTML code into some page the victim views, for > example in a bulletin > board or chat. > 3. When the victim views the attacker's page, his browser sends a request > prepared by the attacker to > the attacked application. If the victim is logged in to the target > application, his browser will possess > all necessary session tokens, so the request will appear as authorized to the > application and > succeed. > A cross-site request forgery attack uses the fact that the victim's browser > possesses the necessary > authentication tokens to perform some actions in the target application. > *Impact* > A remote, unauthenticated attacker that can trick an authenticated user into > clicking a link crafted by the > attacker or open a malicious web page, can force the victim to unknowingly > perform various actions within > the application. > Given that the whole application is not protected against CSRF, any action > that an administrator can take on > Apache Manifold could be unknowingly performed if they fall for a CSRF attack. > *Affected Systems* > * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] > *Description* > It appears that the application does not implement any CSRF protection. > Consider the following example. An > attacker tricks a logged in application user to visit a page containing the > following code: > {code:java} > > > > history.pushState('', '', '/') > https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp; > method="POST" enctype="multipart/form-data"> > > > > > > > value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr > awlerConnector" /> > > > > value="ferdiklompcraftworkznl" /> > > > > > > > > > > value="httpsintrauatwebbc" /> > /> > > value="validation" /> > value="" > /> > value="Continue" /> > value="username" /> > value="id996812" /> > value="" /> > value="Continue" /> > value="password" /> > value="Th1sIs4cl1X" /> > value="" /> > value="Continue" /> > value="loginformtype" /> > value="pwd" /> > value="" /> > value="3" /> > > > value="httpsintrauatwebbc" /> > > > > > > > > > > > > > > {code} > When the victim's browser parses the page and tries to load images, it will > cause them to execute any action > of the attacker's choosing on Manifold. > *Recommendations* > The usual approach to preventing CSRF attacks is to add a new parameter with > an unpredictable value to > each form or link that performs some action in the application, commonly > referred to as a CSRF-Token. The > parameter value should have enough entropy so that it cannot be predicted by > an attacker and should be > unique to the current user session. When the user submits the form or clicks > the link, the server side code > checks the parameter value. If it is valid, the request is accepted, > otherwise it is denied. The attacker has no > way of knowing the value of the unpredictable parameter, so he cannot > construct a form or link that will > submit a valid request. > *References* > * OWASP - Cross-Site Request Forgery - > [https://www.owasp.org/index.php/Cross-] > Site_Request_Forgery -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1595) cross-site request forgery vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803842#comment-16803842 ] Karl Wright commented on CONNECTORS-1595: - [~goovaertsr] I am going to assign these to the fellow who wrote the current UI and see what he says. I expect some things would be easier to address than others. > cross-site request forgery vulnerability > > > Key: CONNECTORS-1595 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1595 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Minor > > Below is the full analysis and description as a result from the penetration > test. > *Summary* > The application is vulnerable to Cross-Site Request Forgery (CSRF). > A cross-site request forgery attack uses the following scenario: > 1. An attacker creates a web page that includes an image or a form pointing > to the attacked application. > The image source would actually be a URL with parameters pointing to the > application page that > performs some action. In case of a form, the form action would point to the > action page in the target > application, and the form is submitted automatically by JavaScript when the > page is viewed. > 2. The attacker tricks the victim user to browse to this page. The attacker > may get the victim to click a > link, or embed the attacking HTML code into some page the victim views, for > example in a bulletin > board or chat. > 3. When the victim views the attacker's page, his browser sends a request > prepared by the attacker to > the attacked application. If the victim is logged in to the target > application, his browser will possess > all necessary session tokens, so the request will appear as authorized to the > application and > succeed. > A cross-site request forgery attack uses the fact that the victim's browser > possesses the necessary > authentication tokens to perform some actions in the target application. > *Impact* > A remote, unauthenticated attacker that can trick an authenticated user into > clicking a link crafted by the > attacker or open a malicious web page, can force the victim to unknowingly > perform various actions within > the application. > Given that the whole application is not protected against CSRF, any action > that an administrator can take on > Apache Manifold could be unknowingly performed if they fall for a CSRF attack. > *Affected Systems* > * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] > *Description* > It appears that the application does not implement any CSRF protection. > Consider the following example. An > attacker tricks a logged in application user to visit a page containing the > following code: > {code:java} > > > > history.pushState('', '', '/') > https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp; > method="POST" enctype="multipart/form-data"> > > > > > > > value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr > awlerConnector" /> > > > > value="ferdiklompcraftworkznl" /> > > > > > > > > > > value="httpsintrauatwebbc" /> > /> > > value="validation" /> > value="" > /> > value="Continue" /> > value="username" /> > value="id996812" /> > value="" /> > value="Continue" /> > value="password" /> > value="Th1sIs4cl1X" /> > value="" /> > value="Continue" /> > value="loginformtype" /> > value="pwd" /> > value="" /> > value="3" /> > > > value="httpsintrauatwebbc" /> > > > > > > > > > > > > > > {code} > When the victim's browser parses the page and tries to load images, it will > cause them to execute any action > of the attacker's choosing on Manifold. > *Recommendations* > The usual approach to preventing CSRF attacks is to add a new parameter with > an unpredictable value to > each form or link that performs some action in the application, commonly > referred to as a CSRF-Token. The > parameter value should have enough entropy so that it cannot be predicted by > an attacker and should be > unique to the current user session. When the user submits the form or clicks > the link, the server side code > checks the parameter value. If it is valid, the request is accepted, > otherwise it is denied. The attacker has no > way of knowing the value of the unpredictable parameter, so he cannot > construct a form or link that will > submit a valid request. > *References* > * OWASP - Cross-Site Request Forgery - > [https://www.owasp.org/index.php/Cross-] > Site_Request_Forgery -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1597) reflected cross-site scripting vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1597: --- Assignee: Kishore Kumar (was: Karl Wright) > reflected cross-site scripting vulnerability > > > Key: CONNECTORS-1597 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1597 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Assignee: Kishore Kumar >Priority: Minor > > This is the full report of a penetration test, performed at a client where we > deployed a system which uses manifold: > *Summary* > A reflected cross-site scripting vulnerability was discovered in the > application. > Reflected cross-site scripting occurs when a web application displays data > submitted by the user that > contains HTML markup and scripting code without properly escaping it. An > attacker will create a link to the > vulnerable page that will display JavaScript code crated by the attacker. The > attacker will then trick an > authenticated application user into clicking or following this crated link. > When the user's browser parses the > generated page, it will execute the code crafted by the attacker. If the user > was logged in to the application > when he followed the link, the attacker's code could perform any action in > the application that the user can > perform. > *Impact* > Reflected cross-site scripting can be used by attackers to compromise the > session of an authenticated user. > By persuading the victim to click on a specially crafted link, the attacker > can execute his own JavaScript > payload in the browser context of the victim. In this specific case, an > attacker could hijack its victim's session > given that the session token is not flagged as HttpOnly as demonstrated in > [G190204T1F4][MANIFOLD] > Insecure Cookie Configuration. > Additional attacks exist where an attacker can deceive end users of the > application by redirecting them to > replica sites or trick them into downloading trojans or other malware. The > attacker can also use a so called > browser exploitation framework. In this scenario the attacker injects > JavaScript code that communicates to > the attack framework running on the attacker's computer. When the victim user > executes the JavaScript code > the attacker can control the victim's browser. Publicly available frameworks > exist (BeEF - > [http://www.bindshell.net/tools/beef], Backframe > -[http://www.gnucitizen.org/projects/backframe/], XSS Proxy - > [http://xss-proxy.sourceforge.net/]). > *Affected Systems* > * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] [name of an arbitrarily > supplied URL parameter] > *Description* > A case where the application includes user input into the generated HTML > pages without properly escaping > the user supplied data was discovered in the application. The HTTP requests > and responses shown below > demonstrate the problem. > {code:java} > GET /mcf-crawler-ui/?smafi">alert(1)non7x=1 HTTP/1.1 > Host: els-manifold-uat.bc:8475 > Accept-Encoding: gzip, deflate > Accept: */* > Accept-Language: en > User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; > Trident/5.0) > Connection: close > Cookie: JSESSIONID=ov3qae9biucxdat0xiin5s18 > {code} > {code:java} > HTTP/1.1 200 OK > Server: nginx/1.12.2 > Date: Mon, 18 Feb 2019 13:07:02 GMT > Content-Type: text/html;charset=utf-8 > Content-Length: 2576 > Connection: close > Pragma: No-cache > Expires: Thu, 01 Jan 1970 00:00:00 GMT > Cache-Control: no-cache > max-age: Thu, 01 Jan 1970 00:00:00 GMT > > > > http://www.w3.org/1999/xhtml;> > > > > > type="text/css"/> > > Apache ManifoldCF⢠Login > > > > > > > > > > > > Sign in to start your session > method="POST"> > alert(1)non7x=1"> > > --snip-- > {code} > *Recommendations* > We recommend that the application enforces proper validation on user input. > In most situations where usercontrollable > data is copied into application responses, cross-site scripting attacks can > be prevented using two > layers of defenses: > * Input should be validated as strictly as possible on arrival, given the > kind of content which it is > expected to contain. For example, personal names should consist of > alphabetical and a small range > of typographical characters, and be relatively short; a year of birth should > consist of exactly four > numerals; email addresses should match a
[jira] [Assigned] (CONNECTORS-1597) reflected cross-site scripting vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1597: --- Assignee: Karl Wright > reflected cross-site scripting vulnerability > > > Key: CONNECTORS-1597 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1597 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Assignee: Karl Wright >Priority: Minor > > This is the full report of a penetration test, performed at a client where we > deployed a system which uses manifold: > *Summary* > A reflected cross-site scripting vulnerability was discovered in the > application. > Reflected cross-site scripting occurs when a web application displays data > submitted by the user that > contains HTML markup and scripting code without properly escaping it. An > attacker will create a link to the > vulnerable page that will display JavaScript code crated by the attacker. The > attacker will then trick an > authenticated application user into clicking or following this crated link. > When the user's browser parses the > generated page, it will execute the code crafted by the attacker. If the user > was logged in to the application > when he followed the link, the attacker's code could perform any action in > the application that the user can > perform. > *Impact* > Reflected cross-site scripting can be used by attackers to compromise the > session of an authenticated user. > By persuading the victim to click on a specially crafted link, the attacker > can execute his own JavaScript > payload in the browser context of the victim. In this specific case, an > attacker could hijack its victim's session > given that the session token is not flagged as HttpOnly as demonstrated in > [G190204T1F4][MANIFOLD] > Insecure Cookie Configuration. > Additional attacks exist where an attacker can deceive end users of the > application by redirecting them to > replica sites or trick them into downloading trojans or other malware. The > attacker can also use a so called > browser exploitation framework. In this scenario the attacker injects > JavaScript code that communicates to > the attack framework running on the attacker's computer. When the victim user > executes the JavaScript code > the attacker can control the victim's browser. Publicly available frameworks > exist (BeEF - > [http://www.bindshell.net/tools/beef], Backframe > -[http://www.gnucitizen.org/projects/backframe/], XSS Proxy - > [http://xss-proxy.sourceforge.net/]). > *Affected Systems* > * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] [name of an arbitrarily > supplied URL parameter] > *Description* > A case where the application includes user input into the generated HTML > pages without properly escaping > the user supplied data was discovered in the application. The HTTP requests > and responses shown below > demonstrate the problem. > {code:java} > GET /mcf-crawler-ui/?smafi">alert(1)non7x=1 HTTP/1.1 > Host: els-manifold-uat.bc:8475 > Accept-Encoding: gzip, deflate > Accept: */* > Accept-Language: en > User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; > Trident/5.0) > Connection: close > Cookie: JSESSIONID=ov3qae9biucxdat0xiin5s18 > {code} > {code:java} > HTTP/1.1 200 OK > Server: nginx/1.12.2 > Date: Mon, 18 Feb 2019 13:07:02 GMT > Content-Type: text/html;charset=utf-8 > Content-Length: 2576 > Connection: close > Pragma: No-cache > Expires: Thu, 01 Jan 1970 00:00:00 GMT > Cache-Control: no-cache > max-age: Thu, 01 Jan 1970 00:00:00 GMT > > > > http://www.w3.org/1999/xhtml;> > > > > > type="text/css"/> > > Apache ManifoldCF⢠Login > > > > > > > > > > > > Sign in to start your session > method="POST"> > alert(1)non7x=1"> > > --snip-- > {code} > *Recommendations* > We recommend that the application enforces proper validation on user input. > In most situations where usercontrollable > data is copied into application responses, cross-site scripting attacks can > be prevented using two > layers of defenses: > * Input should be validated as strictly as possible on arrival, given the > kind of content which it is > expected to contain. For example, personal names should consist of > alphabetical and a small range > of typographical characters, and be relatively short; a year of birth should > consist of exactly four > numerals; email addresses should match a well-defined regular
[jira] [Commented] (CONNECTORS-1597) reflected cross-site scripting vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803835#comment-16803835 ] Karl Wright commented on CONNECTORS-1597: - Hi [~goovaertsr], I see that all of the security tickets you have opened have to do with usage of the ManifoldCF UI in an open web environment. Please understand that the UI was not designed for the kinds of security concerns one might have in such an environment. The team here is small, and UI design is not an area that has a deep bench. I would therefore urge you to include patches to address the concerns you have, in the best tradition of open-source software. Otherwise there is little chance they will be competently addressed. Thanks in advance. > reflected cross-site scripting vulnerability > > > Key: CONNECTORS-1597 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1597 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Minor > > This is the full report of a penetration test, performed at a client where we > deployed a system which uses manifold: > *Summary* > A reflected cross-site scripting vulnerability was discovered in the > application. > Reflected cross-site scripting occurs when a web application displays data > submitted by the user that > contains HTML markup and scripting code without properly escaping it. An > attacker will create a link to the > vulnerable page that will display JavaScript code crated by the attacker. The > attacker will then trick an > authenticated application user into clicking or following this crated link. > When the user's browser parses the > generated page, it will execute the code crafted by the attacker. If the user > was logged in to the application > when he followed the link, the attacker's code could perform any action in > the application that the user can > perform. > *Impact* > Reflected cross-site scripting can be used by attackers to compromise the > session of an authenticated user. > By persuading the victim to click on a specially crafted link, the attacker > can execute his own JavaScript > payload in the browser context of the victim. In this specific case, an > attacker could hijack its victim's session > given that the session token is not flagged as HttpOnly as demonstrated in > [G190204T1F4][MANIFOLD] > Insecure Cookie Configuration. > Additional attacks exist where an attacker can deceive end users of the > application by redirecting them to > replica sites or trick them into downloading trojans or other malware. The > attacker can also use a so called > browser exploitation framework. In this scenario the attacker injects > JavaScript code that communicates to > the attack framework running on the attacker's computer. When the victim user > executes the JavaScript code > the attacker can control the victim's browser. Publicly available frameworks > exist (BeEF - > [http://www.bindshell.net/tools/beef], Backframe > -[http://www.gnucitizen.org/projects/backframe/], XSS Proxy - > [http://xss-proxy.sourceforge.net/]). > *Affected Systems* > * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] [name of an arbitrarily > supplied URL parameter] > *Description* > A case where the application includes user input into the generated HTML > pages without properly escaping > the user supplied data was discovered in the application. The HTTP requests > and responses shown below > demonstrate the problem. > {code:java} > GET /mcf-crawler-ui/?smafi">alert(1)non7x=1 HTTP/1.1 > Host: els-manifold-uat.bc:8475 > Accept-Encoding: gzip, deflate > Accept: */* > Accept-Language: en > User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; > Trident/5.0) > Connection: close > Cookie: JSESSIONID=ov3qae9biucxdat0xiin5s18 > {code} > {code:java} > HTTP/1.1 200 OK > Server: nginx/1.12.2 > Date: Mon, 18 Feb 2019 13:07:02 GMT > Content-Type: text/html;charset=utf-8 > Content-Length: 2576 > Connection: close > Pragma: No-cache > Expires: Thu, 01 Jan 1970 00:00:00 GMT > Cache-Control: no-cache > max-age: Thu, 01 Jan 1970 00:00:00 GMT > > > > http://www.w3.org/1999/xhtml;> > > > > > type="text/css"/> > > Apache ManifoldCF⢠Login > > > > > > > > > > > > Sign in to start your session > method="POST"> > alert(1)non7x=1"> > > --snip-- > {code} > *Recommendations* > We recommend that the application enforces proper validation on user input. > In most
[jira] [Commented] (CONNECTORS-1595) cross-site request forgery vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803802#comment-16803802 ] Karl Wright commented on CONNECTORS-1595: - [~goovaertsr]: For all of the security tickets you have submitted against MCF, we have no ability to address these ourselves; this is a small project and essentially you are attempting to make the MCF UI safe to operate in an open web environment. That was not its design point, either at the beginning or ever. We are always receptive to patches, so if you have specific code changes you want us to consider, please feel free to attach appropriate patches to the tickets you have created. Thank you. > cross-site request forgery vulnerability > > > Key: CONNECTORS-1595 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1595 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Minor > > Below is the full analysis and description as a result from the penetration > test. > *Summary* > The application is vulnerable to Cross-Site Request Forgery (CSRF). > A cross-site request forgery attack uses the following scenario: > 1. An attacker creates a web page that includes an image or a form pointing > to the attacked application. > The image source would actually be a URL with parameters pointing to the > application page that > performs some action. In case of a form, the form action would point to the > action page in the target > application, and the form is submitted automatically by JavaScript when the > page is viewed. > 2. The attacker tricks the victim user to browse to this page. The attacker > may get the victim to click a > link, or embed the attacking HTML code into some page the victim views, for > example in a bulletin > board or chat. > 3. When the victim views the attacker's page, his browser sends a request > prepared by the attacker to > the attacked application. If the victim is logged in to the target > application, his browser will possess > all necessary session tokens, so the request will appear as authorized to the > application and > succeed. > A cross-site request forgery attack uses the fact that the victim's browser > possesses the necessary > authentication tokens to perform some actions in the target application. > *Impact* > A remote, unauthenticated attacker that can trick an authenticated user into > clicking a link crafted by the > attacker or open a malicious web page, can force the victim to unknowingly > perform various actions within > the application. > Given that the whole application is not protected against CSRF, any action > that an administrator can take on > Apache Manifold could be unknowingly performed if they fall for a CSRF attack. > *Affected Systems* > * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] > *Description* > It appears that the application does not implement any CSRF protection. > Consider the following example. An > attacker tricks a logged in application user to visit a page containing the > following code: > {code:java} > > > > history.pushState('', '', '/') > https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp; > method="POST" enctype="multipart/form-data"> > > > > > > > value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr > awlerConnector" /> > > > > value="ferdiklompcraftworkznl" /> > > > > > > > > > > value="httpsintrauatwebbc" /> > /> > > value="validation" /> > value="" > /> > value="Continue" /> > value="username" /> > value="id996812" /> > value="" /> > value="Continue" /> > value="password" /> > value="Th1sIs4cl1X" /> > value="" /> > value="Continue" /> > value="loginformtype" /> > value="pwd" /> > value="" /> > value="3" /> > > > value="httpsintrauatwebbc" /> > > > > > > > > > > > > > > {code} > When the victim's browser parses the page and tries to load images, it will > cause them to execute any action > of the attacker's choosing on Manifold. > *Recommendations* > The usual approach to preventing CSRF attacks is to add a new parameter with > an unpredictable value to > each form or link that performs some action in the application, commonly > referred to as a CSRF-Token. The > parameter value should have enough entropy so that it cannot be predicted by > an attacker and should be > unique to the current user session. When the user submits the form or clicks > the link, the server side code > checks the parameter value. If it is valid, the request is accepted, > otherwise it is denied. The attacker has no > way of knowing the value of the unpredictable parameter, so he cannot > construct a form or link that will > submit a valid request. > *References* > * OWASP - Cross-Site
[jira] [Reopened] (CONNECTORS-1595) cross-site request forgery vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reopened CONNECTORS-1595: - The complaint is that the manifoldcf user interface has this issue. Once again, the MCF user interface is a back-office app and does not go against untrusted open network systems. > cross-site request forgery vulnerability > > > Key: CONNECTORS-1595 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1595 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Minor > > Below is the full analysis and description as a result from the penetration > test. > *Summary* > The application is vulnerable to Cross-Site Request Forgery (CSRF). > A cross-site request forgery attack uses the following scenario: > 1. An attacker creates a web page that includes an image or a form pointing > to the attacked application. > The image source would actually be a URL with parameters pointing to the > application page that > performs some action. In case of a form, the form action would point to the > action page in the target > application, and the form is submitted automatically by JavaScript when the > page is viewed. > 2. The attacker tricks the victim user to browse to this page. The attacker > may get the victim to click a > link, or embed the attacking HTML code into some page the victim views, for > example in a bulletin > board or chat. > 3. When the victim views the attacker's page, his browser sends a request > prepared by the attacker to > the attacked application. If the victim is logged in to the target > application, his browser will possess > all necessary session tokens, so the request will appear as authorized to the > application and > succeed. > A cross-site request forgery attack uses the fact that the victim's browser > possesses the necessary > authentication tokens to perform some actions in the target application. > *Impact* > A remote, unauthenticated attacker that can trick an authenticated user into > clicking a link crafted by the > attacker or open a malicious web page, can force the victim to unknowingly > perform various actions within > the application. > Given that the whole application is not protected against CSRF, any action > that an administrator can take on > Apache Manifold could be unknowingly performed if they fall for a CSRF attack. > *Affected Systems* > * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] > *Description* > It appears that the application does not implement any CSRF protection. > Consider the following example. An > attacker tricks a logged in application user to visit a page containing the > following code: > {code:java} > > > > history.pushState('', '', '/') > https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp; > method="POST" enctype="multipart/form-data"> > > > > > > > value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr > awlerConnector" /> > > > > value="ferdiklompcraftworkznl" /> > > > > > > > > > > value="httpsintrauatwebbc" /> > /> > > value="validation" /> > value="" > /> > value="Continue" /> > value="username" /> > value="id996812" /> > value="" /> > value="Continue" /> > value="password" /> > value="Th1sIs4cl1X" /> > value="" /> > value="Continue" /> > value="loginformtype" /> > value="pwd" /> > value="" /> > value="3" /> > > > value="httpsintrauatwebbc" /> > > > > > > > > > > > > > > {code} > When the victim's browser parses the page and tries to load images, it will > cause them to execute any action > of the attacker's choosing on Manifold. > *Recommendations* > The usual approach to preventing CSRF attacks is to add a new parameter with > an unpredictable value to > each form or link that performs some action in the application, commonly > referred to as a CSRF-Token. The > parameter value should have enough entropy so that it cannot be predicted by > an attacker and should be > unique to the current user session. When the user submits the form or clicks > the link, the server side code > checks the parameter value. If it is valid, the request is accepted, > otherwise it is denied. The attacker has no > way of knowing the value of the unpredictable parameter, so he cannot > construct a form or link that will > submit a valid request. > *References* > * OWASP - Cross-Site Request Forgery - > [https://www.owasp.org/index.php/Cross-] > Site_Request_Forgery -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (CONNECTORS-1595) cross-site request forgery vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1595: Comment: was deleted (was: This is not applicable to MCF, since the domain scope of the pages fetched by it during a web crawl are explicitly laid out by configuration, and thus "redirection to a malicious page" is not something that can actually take place unless the person who sets up the crawling job does this by specific design. ) > cross-site request forgery vulnerability > > > Key: CONNECTORS-1595 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1595 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Minor > > Below is the full analysis and description as a result from the penetration > test. > *Summary* > The application is vulnerable to Cross-Site Request Forgery (CSRF). > A cross-site request forgery attack uses the following scenario: > 1. An attacker creates a web page that includes an image or a form pointing > to the attacked application. > The image source would actually be a URL with parameters pointing to the > application page that > performs some action. In case of a form, the form action would point to the > action page in the target > application, and the form is submitted automatically by JavaScript when the > page is viewed. > 2. The attacker tricks the victim user to browse to this page. The attacker > may get the victim to click a > link, or embed the attacking HTML code into some page the victim views, for > example in a bulletin > board or chat. > 3. When the victim views the attacker's page, his browser sends a request > prepared by the attacker to > the attacked application. If the victim is logged in to the target > application, his browser will possess > all necessary session tokens, so the request will appear as authorized to the > application and > succeed. > A cross-site request forgery attack uses the fact that the victim's browser > possesses the necessary > authentication tokens to perform some actions in the target application. > *Impact* > A remote, unauthenticated attacker that can trick an authenticated user into > clicking a link crafted by the > attacker or open a malicious web page, can force the victim to unknowingly > perform various actions within > the application. > Given that the whole application is not protected against CSRF, any action > that an administrator can take on > Apache Manifold could be unknowingly performed if they fall for a CSRF attack. > *Affected Systems* > * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] > *Description* > It appears that the application does not implement any CSRF protection. > Consider the following example. An > attacker tricks a logged in application user to visit a page containing the > following code: > {code:java} > > > > history.pushState('', '', '/') > https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp; > method="POST" enctype="multipart/form-data"> > > > > > > > value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr > awlerConnector" /> > > > > value="ferdiklompcraftworkznl" /> > > > > > > > > > > value="httpsintrauatwebbc" /> > /> > > value="validation" /> > value="" > /> > value="Continue" /> > value="username" /> > value="id996812" /> > value="" /> > value="Continue" /> > value="password" /> > value="Th1sIs4cl1X" /> > value="" /> > value="Continue" /> > value="loginformtype" /> > value="pwd" /> > value="" /> > value="3" /> > > > value="httpsintrauatwebbc" /> > > > > > > > > > > > > > > {code} > When the victim's browser parses the page and tries to load images, it will > cause them to execute any action > of the attacker's choosing on Manifold. > *Recommendations* > The usual approach to preventing CSRF attacks is to add a new parameter with > an unpredictable value to > each form or link that performs some action in the application, commonly > referred to as a CSRF-Token. The > parameter value should have enough entropy so that it cannot be predicted by > an attacker and should be > unique to the current user session. When the user submits the form or clicks > the link, the server side code > checks the parameter value. If it is valid, the request is accepted, > otherwise it is denied. The attacker has no > way of knowing the value of the unpredictable parameter, so he cannot > construct a form or link that will > submit a valid request. > *References* > * OWASP - Cross-Site Request Forgery - > [https://www.owasp.org/index.php/Cross-] > Site_Request_Forgery -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1595) cross-site request forgery vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1595. - Resolution: Not A Problem This is not applicable to MCF, since the domain scope of the pages fetched by it during a web crawl are explicitly laid out by configuration, and thus "redirection to a malicious page" is not something that can actually take place unless the person who sets up the crawling job does this by specific design. > cross-site request forgery vulnerability > > > Key: CONNECTORS-1595 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1595 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Minor > > Below is the full analysis and description as a result from the penetration > test. > *Summary* > The application is vulnerable to Cross-Site Request Forgery (CSRF). > A cross-site request forgery attack uses the following scenario: > 1. An attacker creates a web page that includes an image or a form pointing > to the attacked application. > The image source would actually be a URL with parameters pointing to the > application page that > performs some action. In case of a form, the form action would point to the > action page in the target > application, and the form is submitted automatically by JavaScript when the > page is viewed. > 2. The attacker tricks the victim user to browse to this page. The attacker > may get the victim to click a > link, or embed the attacking HTML code into some page the victim views, for > example in a bulletin > board or chat. > 3. When the victim views the attacker's page, his browser sends a request > prepared by the attacker to > the attacked application. If the victim is logged in to the target > application, his browser will possess > all necessary session tokens, so the request will appear as authorized to the > application and > succeed. > A cross-site request forgery attack uses the fact that the victim's browser > possesses the necessary > authentication tokens to perform some actions in the target application. > *Impact* > A remote, unauthenticated attacker that can trick an authenticated user into > clicking a link crafted by the > attacker or open a malicious web page, can force the victim to unknowingly > perform various actions within > the application. > Given that the whole application is not protected against CSRF, any action > that an administrator can take on > Apache Manifold could be unknowingly performed if they fall for a CSRF attack. > *Affected Systems* > * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] > *Description* > It appears that the application does not implement any CSRF protection. > Consider the following example. An > attacker tricks a logged in application user to visit a page containing the > following code: > {code:java} > > > > history.pushState('', '', '/') > https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp; > method="POST" enctype="multipart/form-data"> > > > > > > > value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr > awlerConnector" /> > > > > value="ferdiklompcraftworkznl" /> > > > > > > > > > > value="httpsintrauatwebbc" /> > /> > > value="validation" /> > value="" > /> > value="Continue" /> > value="username" /> > value="id996812" /> > value="" /> > value="Continue" /> > value="password" /> > value="Th1sIs4cl1X" /> > value="" /> > value="Continue" /> > value="loginformtype" /> > value="pwd" /> > value="" /> > value="3" /> > > > value="httpsintrauatwebbc" /> > > > > > > > > > > > > > > {code} > When the victim's browser parses the page and tries to load images, it will > cause them to execute any action > of the attacker's choosing on Manifold. > *Recommendations* > The usual approach to preventing CSRF attacks is to add a new parameter with > an unpredictable value to > each form or link that performs some action in the application, commonly > referred to as a CSRF-Token. The > parameter value should have enough entropy so that it cannot be predicted by > an attacker and should be > unique to the current user session. When the user submits the form or clicks > the link, the server side code > checks the parameter value. If it is valid, the request is accepted, > otherwise it is denied. The attacker has no > way of knowing the value of the unpredictable parameter, so he cannot > construct a form or link that will > submit a valid request. > *References* > * OWASP - Cross-Site Request Forgery - > [https://www.owasp.org/index.php/Cross-] > Site_Request_Forgery -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1596) brute-force vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803005#comment-16803005 ] Karl Wright commented on CONNECTORS-1596: - The ManifoldCF UI is not expected to be used in an open web environment, but in a back-office environment. Security protections designed to prevent remote hackers from getting into the UI using sophisticated tools are therefore not expected. Similarly, there will be no attempt to implement dual-factor authentication for the MCF admin UI. > brute-force vulnerability > - > > Key: CONNECTORS-1596 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1596 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Minor > > As a result of a pen test, it appears there is no functionality to counter > brute-force attacks for logging in. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1595) cross-site request forgery vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802998#comment-16802998 ] Karl Wright commented on CONNECTORS-1595: - Please describe (1) what the attack looks like and (2) how this compromises MCF security. > cross-site request forgery vulnerability > > > Key: CONNECTORS-1595 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1595 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Minor > > It appears that manifoldcf does not implement any CSRF protection. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1594) insecure cookie configuration vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802996#comment-16802996 ] Karl Wright commented on CONNECTORS-1594: - The issue described will not in any way hijack what MCF indexes. The concern is that the session ID can be retrieved by a man-in-the-middle should you be crawling a Broadvision site that has both http and https pages. I would argue that that is in fact a site design issue, not a MCF security vulnerability. > insecure cookie configuration vulnerability > --- > > Key: CONNECTORS-1594 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1594 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Minor > > The application session cookie "JSESSIONID" does not have Secure and HTTPOnly > flags set. > The application uses an HTTP cookie as session identifier. The Set-Cookie > instruction sent by the application to the browser does not specifically > instruct the browser to only use the cookie on secure communication channels > (HTTPS). As the instruction is missing, browsers will fall back to their > default setting, generally meaning that the cookie will be used on both > secure and insecure communication channels. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1597) reflected cross-site scripting vulnerability
[ https://issues.apache.org/jira/browse/CONNECTORS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802984#comment-16802984 ] Karl Wright commented on CONNECTORS-1597: - Please give more details. Bear in mind that ManifoldCF does not execute any Javascript, so offhand I find this hard to believe. > reflected cross-site scripting vulnerability > > > Key: CONNECTORS-1597 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1597 > Project: ManifoldCF > Issue Type: Improvement > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Minor > > As a result from a pen test, a reflected cross-site scripting vulnerability > was discovered -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798910#comment-16798910 ] Karl Wright commented on CONNECTORS-1593: - There is a philosophy about memory consumption that we rigorously adhere to in ManifoldCF which is known as the "bounded memory consumption" philosophy, which is that connectors must be written so they are not sensitive to the size of the data they are indexing. Streams are used and the data does not ever "hit memory". But if you aren't careful, the custom connector you have might well put entire documents into memory and then of course all you need would be two large documents at the same time and you are hosed. Can you check your custom connector for that issue? If there is a problem there, you could work around it by limiting the number of custom connector connections to 1. If that works reliably, then you know where the issue is. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: image-2019-03-22-08-57-53-887.png > > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796057#comment-16796057 ] Karl Wright commented on CONNECTORS-1593: - So just to be sure we are in agreement: Your Tika extractor connection has a max number of connections set to just 2, and you are still seeing "out of memory" with 8GB?? The max connections for the repository connections and for the output connections can be set to anything that makes sense for the services they are connecting to. These numbers are all independent, but obviously there will be throttling that takes place as a result of not having sufficient connections available at all times in your pipeline. The other connections and worker thread count obviously all contribute to the maximum memory consumption as well -- if you have other connectors that are not written with memory limits in mind then you can easily run into problems of this kind, and the true culprit that is driving memory consumption might have nothing to do with the stack trace you see. May I ask what connectors are involved? Are any custom? > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796003#comment-16796003 ] Karl Wright commented on CONNECTORS-1593: - [~DonaldVdD], then from your description it sounds like the problem isn't with PDFBox. It's because you have more worker threads allowed than you have memory available. So if all of the worker threads wind up working on memory-expensive documents at the same time, they collide and you run out of memory. How many worker threads did you allot? > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at >
[jira] [Comment Edited] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796003#comment-16796003 ] Karl Wright edited comment on CONNECTORS-1593 at 3/19/19 11:52 AM: --- [~DonaldVdD], then from your description it sounds like the problem isn't with PDFBox. It's because you have more worker threads allowed than you have memory available. So if all of the worker threads wind up working on memory-expensive documents at the same time, they collide and you run out of memory. How many worker threads did you allot? How many Tika transformer connections to you have specified as the max? The real number of simultaneous Tika extractions that take place will be the minimum of these two values. was (Author: kwri...@metacarta.com): [~DonaldVdD], then from your description it sounds like the problem isn't with PDFBox. It's because you have more worker threads allowed than you have memory available. So if all of the worker threads wind up working on memory-expensive documents at the same time, they collide and you run out of memory. How many worker threads did you allot? > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at >
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795983#comment-16795983 ] Karl Wright commented on CONNECTORS-1593: - The other possibility for figuring this out is to use the external Tika service, and then the MCF agents process won't be killed while crawling. Instead, errors will be logged for the specific documents that cause the issue. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at >
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795982#comment-16795982 ] Karl Wright commented on CONNECTORS-1593: - [~DonaldVdD], I think you will need to identify the document and make it available to them (if possible). That's not going to be easy I'm afraid but maybe with connector logging turned on it might be possible. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at >
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795947#comment-16795947 ] Karl Wright commented on CONNECTORS-1593: - No suggestions, unfortunately. Can you let me know what the PDFBox ticket is? > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at >
[jira] [Assigned] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1593: --- Assignee: Karl Wright > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) > Mar 16 14:20:06 manifold01
[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update "last checked" time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795722#comment-16795722 ] Karl Wright commented on CONNECTORS-880: [~SubasiniR], your issue has nothing whatsoever to do with this ticket. It really belongs first on the user list. The issue is that your database is going offline for 2700 seconds while your crawl is taking place, or almost 45 minutes. Queries that normally would be instantaneous are therefore just not being completed at all for that period of time. The plans look fine so that isn't it. If this is using HSQLDB (which is the default database for the single-process example), then you probably have exceeded its capacity. It stores all of its tables in memory. You will want to upgrade to a real database instead. I would preter postgresql over mysql because mysql has been having transactional integrity issues for a couple of versions now, and that will be fatal to use with ManifoldCF. By the way, "Illegal seed URL" is a warning and does not impact behavior other than to notify you that one of the seeds you are using in your crawl is not valid according to the w3c spec. The seed will not be used. > Under the right conditions, job aborts do not update "last checked" time > > > Key: CONNECTORS-880 > URL: https://issues.apache.org/jira/browse/CONNECTORS-880 > Project: ManifoldCF > Issue Type: Bug > Components: Framework crawler agent >Affects Versions: ManifoldCF 1.4.1 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 1.6 > > > When a scheduled job is being considered to be started, MCF updates the > last-check field ONLY if the job didn't start. It relies on the job's > completion to set the last-check field in the case where the job does start. > But if the job aborts, in at least one case the last-check field is NOT > updated. This leads to the job being run over and over again within the > schedule window. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1591) RTF comment parsing problem
[ https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1591: Fix Version/s: ManifoldCF 2.13 > RTF comment parsing problem > --- > > Key: CONNECTORS-1591 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1591 > Project: ManifoldCF > Issue Type: Bug >Reporter: Zoltan Farago >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: comment.rtf, result.txt > > > We have a problem with Manifold/Tika. When a comment is parsed from and RTF > file, the result has no separator. see attachments -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2838) RTF document processing glues comment fields together with text without whitespace
Karl Wright created TIKA-2838: - Summary: RTF document processing glues comment fields together with text without whitespace Key: TIKA-2838 URL: https://issues.apache.org/jira/browse/TIKA-2838 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.19, 1.17 Reporter: Karl Wright See ManifoldCF ticket CONNECTORS-1591 for a sample document and a description of the problem. Basically, comment fields for RTF documents are glued together with no whitespace between them, while other document formats properly put in a space (e.g. .docx etc). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1591) RTF comment parsing problem
[ https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1591: --- Assignee: Karl Wright > RTF comment parsing problem > --- > > Key: CONNECTORS-1591 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1591 > Project: ManifoldCF > Issue Type: Bug >Reporter: Zoltan Farago >Assignee: Karl Wright >Priority: Major > Attachments: comment.rtf, result.txt > > > We have a problem with Manifold/Tika. When a comment is parsed from and RTF > file, the result has no separator. see attachments -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1591) RTF comment parsing problem
[ https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790343#comment-16790343 ] Karl Wright commented on CONNECTORS-1591: - Hi [~zfarago], I think the right approach here is to leave this ticket open and link to a TIKA ticket describing your problem. The issue is not really related to ManifoldCF itself, and we cannot solve it for you until the Tika team corrects the issue. I'll go ahead and create the linked ticket. > RTF comment parsing problem > --- > > Key: CONNECTORS-1591 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1591 > Project: ManifoldCF > Issue Type: Bug >Reporter: Zoltan Farago >Priority: Major > Attachments: comment.rtf, result.txt > > > We have a problem with Manifold/Tika. When a comment is parsed from and RTF > file, the result has no separator. see attachments -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1591) RTF comment parsing problem
[ https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790294#comment-16790294 ] Karl Wright edited comment on CONNECTORS-1591 at 3/12/19 7:18 AM: -- [~zfarago] Ok, we're getting closer. What version of ManifoldCF is this? And, are you using the ES mapper attachment? was (Author: kwri...@metacarta.com): [~zfarago] Ok, we're getting closer. What version of ManifoldCF is this? > RTF comment parsing problem > --- > > Key: CONNECTORS-1591 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1591 > Project: ManifoldCF > Issue Type: Bug >Reporter: Zoltan Farago >Priority: Major > Attachments: comment.rtf, result.txt > > > We have a problem with Manifold/Tika. When a comment is parsed from and RTF > file, the result has no separator. see attachments -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1591) RTF comment parsing problem
[ https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790294#comment-16790294 ] Karl Wright commented on CONNECTORS-1591: - [~zfarago] Ok, we're getting closer. What version of ManifoldCF is this? > RTF comment parsing problem > --- > > Key: CONNECTORS-1591 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1591 > Project: ManifoldCF > Issue Type: Bug >Reporter: Zoltan Farago >Priority: Major > Attachments: comment.rtf, result.txt > > > We have a problem with Manifold/Tika. When a comment is parsed from and RTF > file, the result has no separator. see attachments -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1591) RTF comment parsing problem
[ https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789712#comment-16789712 ] Karl Wright commented on CONNECTORS-1591: - [~zfarago] When you run a ManifoldCF job that fetches an RTF document and runs it through the Tika extractor, what comes out is a stream of characters (the content stream) plus various metadata fields. All of these are sent to the output connector, which then does whatever it wants with these. You *cannot* see the content stream nor the metadata directly. So I need to know where you are getting result.txt from. There is a missing step that you aren't telling me about and it's a critical one. > RTF comment parsing problem > --- > > Key: CONNECTORS-1591 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1591 > Project: ManifoldCF > Issue Type: Bug >Reporter: Zoltan Farago >Priority: Major > Attachments: comment.rtf, result.txt > > > We have a problem with Manifold/Tika. When a comment is parsed from and RTF > file, the result has no separator. see attachments -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1591) RTF comment parsing problem
[ https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789698#comment-16789698 ] Karl Wright commented on CONNECTORS-1591: - I will repeat the question. *Where* is result.txt coming from? Where are you finding it? Is it content or metadata? If metadata, what metadata field? > RTF comment parsing problem > --- > > Key: CONNECTORS-1591 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1591 > Project: ManifoldCF > Issue Type: Bug >Reporter: Zoltan Farago >Priority: Major > Attachments: comment.rtf, result.txt > > > We have a problem with Manifold/Tika. When a comment is parsed from and RTF > file, the result has no separator. see attachments -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1591) RTF comment parsing problem
[ https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789564#comment-16789564 ] Karl Wright commented on CONNECTORS-1591: - Hi [~zfarago], the result.xml file you attached is certainly not xml. Was this intended? In its current form I have no idea what this is and what it's supposed to represent and where you got it from exactly. Please clarify that, and also clarify what you *expect* to see. Bear in mind that if you are looking at the actual content or metadata output of the Tika Extractor, it's no help to create a ticket against ManifoldCF for that. We do not develop Tika and there nothing we could do other than open a Tika ticket. So I suggest that you do that instead. > RTF comment parsing problem > --- > > Key: CONNECTORS-1591 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1591 > Project: ManifoldCF > Issue Type: Bug >Reporter: Zoltan Farago >Priority: Major > Attachments: comment.rtf, result.xml > > > We have a problem with Manifold/Tika. When a comment is parsed from and RTF > file, the result has no separator. see attachments -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1590) Resources should be closed in a finally block
[ https://issues.apache.org/jira/browse/CONNECTORS-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1590: --- Assignee: Karl Wright > Resources should be closed in a finally block > - > > Key: CONNECTORS-1590 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1590 > Project: ManifoldCF > Issue Type: Bug > Components: Framework core >Affects Versions: ManifoldCF 2.12 >Reporter: Cihad Guzel >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > {code:java} > public class DBInterfaceHSQLDB extends Database implements IDBInterface { > ... > try > { > Connection c = > DriverManager.getConnection(_localUrl+databaseName,userName,password); > Statement s = c.createStatement(); > s.execute("SHUTDOWN"); > c.close(); > } > catch (Exception e) > { > // Never any exception! > e.printStackTrace(); > } > {code} > Connections that implement the _Closeable_ interface or its super-interface, > _AutoCloseable_, needs to be closed after use. That close call must be made > in a *finally* block otherwise an exception could keep the call from being > made. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1590) Resources should be closed in a finally block
[ https://issues.apache.org/jira/browse/CONNECTORS-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1590. - Resolution: Won't Fix > Resources should be closed in a finally block > - > > Key: CONNECTORS-1590 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1590 > Project: ManifoldCF > Issue Type: Bug > Components: Framework core >Affects Versions: ManifoldCF 2.12 >Reporter: Cihad Guzel >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > {code:java} > public class DBInterfaceHSQLDB extends Database implements IDBInterface { > ... > try > { > Connection c = > DriverManager.getConnection(_localUrl+databaseName,userName,password); > Statement s = c.createStatement(); > s.execute("SHUTDOWN"); > c.close(); > } > catch (Exception e) > { > // Never any exception! > e.printStackTrace(); > } > {code} > Connections that implement the _Closeable_ interface or its super-interface, > _AutoCloseable_, needs to be closed after use. That close call must be made > in a *finally* block otherwise an exception could keep the call from being > made. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1590) Resources should be closed in a finally block
[ https://issues.apache.org/jira/browse/CONNECTORS-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782728#comment-16782728 ] Karl Wright commented on CONNECTORS-1590: - This particular invocation is only ever invoked when we're shutting down ManifoldCF. You will find that there are no other such invocations where resources are potentially leaked. In this case, the leak is harmless because we are in the process of shutting down anyway. I don't believe this requires a "fix". > Resources should be closed in a finally block > - > > Key: CONNECTORS-1590 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1590 > Project: ManifoldCF > Issue Type: Bug > Components: Framework core >Affects Versions: ManifoldCF 2.12 >Reporter: Cihad Guzel >Priority: Major > Fix For: ManifoldCF 2.13 > > > {code:java} > public class DBInterfaceHSQLDB extends Database implements IDBInterface { > ... > try > { > Connection c = > DriverManager.getConnection(_localUrl+databaseName,userName,password); > Statement s = c.createStatement(); > s.execute("SHUTDOWN"); > c.close(); > } > catch (Exception e) > { > // Never any exception! > e.printStackTrace(); > } > {code} > Connections that implement the _Closeable_ interface or its super-interface, > _AutoCloseable_, needs to be closed after use. That close call must be made > in a *finally* block otherwise an exception could keep the call from being > made. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1589) lrusize always null
[ https://issues.apache.org/jira/browse/CONNECTORS-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1589. - Resolution: Fixed r1854702 > lrusize always null > --- > > Key: CONNECTORS-1589 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1589 > Project: ManifoldCF > Issue Type: Bug > Components: Framework core >Affects Versions: ManifoldCF 2.12 >Reporter: Cihad Guzel >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > {code:java} > public abstract class BaseDescription implements ICacheDescription { > ... > public int getMaxLRUCount() >... > String x = null; // > JSKW.getProperty("cache."+objectClassName+".lrusize"); > if (x == null) >... > {code} > Change this condition so that it does not always evaluate to "true" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1589) lrusize always null
[ https://issues.apache.org/jira/browse/CONNECTORS-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782724#comment-16782724 ] Karl Wright commented on CONNECTORS-1589: - This requires an infrastructure change. The infrastructure change requires the use of a different constructor when property control over max lru count is desired for a class of cached object. I've looked at the places where BaseDescription is extended, and in no case did I find a compelling case for using properties to control LRU max size. So I've left those alone. > lrusize always null > --- > > Key: CONNECTORS-1589 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1589 > Project: ManifoldCF > Issue Type: Bug > Components: Framework core >Affects Versions: ManifoldCF 2.12 >Reporter: Cihad Guzel >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > {code:java} > public abstract class BaseDescription implements ICacheDescription { > ... > public int getMaxLRUCount() >... > String x = null; // > JSKW.getProperty("cache."+objectClassName+".lrusize"); > if (x == null) >... > {code} > Change this condition so that it does not always evaluate to "true" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1589) lrusize always null
[ https://issues.apache.org/jira/browse/CONNECTORS-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1589: --- Assignee: Karl Wright > lrusize always null > --- > > Key: CONNECTORS-1589 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1589 > Project: ManifoldCF > Issue Type: Bug > Components: Framework core >Affects Versions: ManifoldCF 2.12 >Reporter: Cihad Guzel >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > {code:java} > public abstract class BaseDescription implements ICacheDescription { > ... > public int getMaxLRUCount() >... > String x = null; // > JSKW.getProperty("cache."+objectClassName+".lrusize"); > if (x == null) >... > {code} > Change this condition so that it does not always evaluate to "true" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1588) Custom Jcifs Properties
[ https://issues.apache.org/jira/browse/CONNECTORS-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1588: --- Assignee: Karl Wright > Custom Jcifs Properties > --- > > Key: CONNECTORS-1588 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1588 > Project: ManifoldCF > Issue Type: Improvement > Components: JCIFS connector >Affects Versions: ManifoldCF 2.12 >Reporter: Cihad Guzel >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: CONNECTORS-1588 > > > In some cases, "jcifs" is running slowly. In order to solve this problem, we > need to set custom some properties. > > For example; my problem was in my test environment: I have a windows server > and an ubuntu server in same network in AWS EC2 Service. The windows server > has Active Directory service, DNS Server and shared folder while the ubuntu > server has some instance such as manifoldcf, an db instance and solr. > > If the DNS settings are not defined on the ubuntu server, jcifs runs slowly. > Because the default resolver order is set as 'LMHOSTS,DNS,WINS'. It means[1] > ; firstly "jcifs" checks '/etc/hosts' files for linux/unix server'', then it > checks the DNS server. In my opinion, the linux server doesn't recognize the > DNS server and threads are waiting for every file for access to read. > > I suppose, WINS is used when accessing hosts on different subnets. So, I > have set "jcifs.resolveOrder = WINS" and my problem has been FIXED. > > Another suggestion for similar problem from [another > example|https://stackoverflow.com/a/18837754] : "-Djcifs.resolveOrder = DNS" > > We need to set custom resolveOrder variable. > ^[1]^ [https://www.jcifs.org/src/docs/resolver.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1588) Custom Jcifs Properties
[ https://issues.apache.org/jira/browse/CONNECTORS-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780473#comment-16780473 ] Karl Wright commented on CONNECTORS-1588: - Patch looks fine. I'll commit it. > Custom Jcifs Properties > --- > > Key: CONNECTORS-1588 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1588 > Project: ManifoldCF > Issue Type: Improvement > Components: JCIFS connector >Affects Versions: ManifoldCF 2.12 >Reporter: Cihad Guzel >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: CONNECTORS-1588 > > > In some cases, "jcifs" is running slowly. In order to solve this problem, we > need to set custom some properties. > > For example; my problem was in my test environment: I have a windows server > and an ubuntu server in same network in AWS EC2 Service. The windows server > has Active Directory service, DNS Server and shared folder while the ubuntu > server has some instance such as manifoldcf, an db instance and solr. > > If the DNS settings are not defined on the ubuntu server, jcifs runs slowly. > Because the default resolver order is set as 'LMHOSTS,DNS,WINS'. It means[1] > ; firstly "jcifs" checks '/etc/hosts' files for linux/unix server'', then it > checks the DNS server. In my opinion, the linux server doesn't recognize the > DNS server and threads are waiting for every file for access to read. > > I suppose, WINS is used when accessing hosts on different subnets. So, I > have set "jcifs.resolveOrder = WINS" and my problem has been FIXED. > > Another suggestion for similar problem from [another > example|https://stackoverflow.com/a/18837754] : "-Djcifs.resolveOrder = DNS" > > We need to set custom resolveOrder variable. > ^[1]^ [https://www.jcifs.org/src/docs/resolver.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780373#comment-16780373 ] Karl Wright commented on CONNECTORS-1563: - Hi [~Subasini], The "excluded mime types" that you set are meant to exclude documents *entirely*, so changing that setting has no effect on *how* documents are indexed. You can look at the Simple History report to verify that this is taking place as you desire, because most connectors create a record when they reject a document for any reason. The Web Connector is no exception. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, Manifold and Solr > settings_CustomField.docx, managed-schema, manifold settings.docx, > manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved LUCENE-8696. - Resolution: Fixed Fix Version/s: 7.7.2 master (9.0) 8.x > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Fix For: 8.x, master (9.0), 7.7.2 > > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (CONNECTORS-1587) Unable to Crawl Documents Meta data
[ https://issues.apache.org/jira/browse/CONNECTORS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778989#comment-16778989 ] Karl Wright commented on CONNECTORS-1587: - It is simple; the crawler is requesting more metadata columns at one time than your Sharepoint instance is allowed to respond to. This is a SharePoint configuration issue, apparently, although it is one I've never heard of before. It's certainly *not* a SharePoint Connector issue, unless there's some hard-wired Microsoft limit that you are up against. > Unable to Crawl Documents Meta data > --- > > Key: CONNECTORS-1587 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1587 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Major > > I tried to crawl the meta data of Document section. but cannot able to crawl > the data. > I have facing error stating that " The query cannot be completed because the > number of lookup columns it contains exceeds the lookup column threshold > enforced by the administrator." > How can I resolve this issue.Is there any config needs for that. > Please assist the same. > While checking for documentation mentioned the meta data contents as > drop-down type but my connector(Manifold 2.9.1) is different. Is there any > version update is there for this lookup. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1587) Unable to Crawl Documents Meta data
[ https://issues.apache.org/jira/browse/CONNECTORS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1587. - Resolution: Invalid Not a ManifoldCF bug > Unable to Crawl Documents Meta data > --- > > Key: CONNECTORS-1587 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1587 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Major > > I tried to crawl the meta data of Document section. but cannot able to crawl > the data. > I have facing error stating that " The query cannot be completed because the > number of lookup columns it contains exceeds the lookup column threshold > enforced by the administrator." > How can I resolve this issue.Is there any config needs for that. > Please assist the same. > While checking for documentation mentioned the meta data contents as > drop-down type but my connector(Manifold 2.9.1) is different. Is there any > version update is there for this lookup. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778159#comment-16778159 ] Karl Wright commented on CONNECTORS-1564: - [~erlendfg], if ModifiedHttpSolrClient overrides this setting already, then I don't understand why it isn't working, unless the override isn't setting it. Is that the case? If so, then the obvious fix is to just set it there. > Support preemptive authentication to Solr connector > --- > > Key: CONNECTORS-1564 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1564 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Reporter: Erlend Garåsen >Assignee: Karl Wright >Priority: Major > Attachments: CONNECTORS-1564.patch > > > We should post preemptively in case the Solr server requires basic > authentication. This will make the communication between ManifoldCF and Solr > much more effective instead of the following: > * Send a HTTP POST request to Solr > * Solr sends a 401 response > * Send the same request, but with a "{{Authorization: Basic}}" header > With preemptive authentication, we can send the header in the first request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778005#comment-16778005 ] Karl Wright commented on LUCENE-8696: - I have confirmed that the above is indeed the issue. I did this by checking whether intersection with the segment end planes was taking place, and it was. There are two ways forward. First way is to make this hack officially part of the code base. That will probably be fine for real-world paths, because real-world paths are much narrower than what occurs in random testing. The second fix would be to change how we represent segment endpoints, so that there is no gap between one of the points and the adjoining path segment. The way to do that is to use TWO planes rather than one, but only when there are two adjoining segments and a gap is thus present. Membership would be tricky because, depending on the specific conformation of the segment endpoint, EITHER plane or BOTH planes would need to match the point being tested. But we could determine this by simply looking at the fourth point in the context of a plane constructed from the other three. Such a change would finally make GeoPaths first-class citizens in the oblate world, at the cost of needing to have a second plane for each segment endpoint. But there's no reason we can't use class inheritance to solve that problem too. So a base SegmentEndpoint class or interface would have multiple implementations, and the right one could be picked at path construction time, to match the conformation. For SPHERE planets, the simplest implementation would still be the one that got used. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777688#comment-16777688 ] Karl Wright commented on LUCENE-8696: - Since we've eliminated the computation of the solid's example intersection points, that basically leaves numerical factors as the only potential cause. Let's examine this further. In the case of GeoPaths on the WGS84 globe, path intersection points are described by "circles", which are in fact just planes that are picked so as to connect the path segments together, as described above. But, each plane with two adjoining segments is selected based on FOUR surface points, not three. That means that there is a gap between one of the points and the actual endpoint circle. When we compute membership in the path, we exclude points in that gap from membership. This is done by considering the path segment end planes as delimiters of membership for both the endpoint "circles" as well as the segments. But, those segment end planes are not considered when determining intersection, because they are "interior" to the path. This means that it is possible for getRelationship() to miss an intersection with the path edge if the "gap" is large enough and everything lines up perfectly, and thus "CONTAINS" is reported where "OVERLAPS" would be the actual correct answer. It should be possible to see if our test case would be resolved by considering path segment end edges. A simple trial code change should be sufficient to know. Then the question becomes how to prevent spurious intersections? We could just permit them (it's allowed in the contract), or we could make more significant changes to path representation, for better accuracy. Stay tuned. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776899#comment-16776899 ] Karl Wright edited comment on LUCENE-8696 at 2/26/19 7:24 AM: -- Reviewing the solid, and what the edge points *should* be: minx, maxx: -0.7731590077686981, 1.0011188539924791 miny, maxy: 0.9519964046486451, 1.0011188539924791 minz, maxz: -0.9977622932859775, 0.9977599768255027 The minz/maxz planes might touch the world at the poles, but probably don't. The maxx plane might touch the world at the max X pole. The minx plane definitely slices the world, so it should generate at least one point. The maxy plane might touch the world at the max Y pole. The miny plane slices the world, so it should generate at least one point. This is the debugging output: {code} [junit4] 2> notableMinXPoints=[] notableMaxXPoints=[] notableMinYPoints=[] notableMaxYPoints=[] notableMinZPoints=[] notableMaxZPoints=[] [junit4] 2> minXEdges=[] maxXEdges=[] minYEdges=[[X=0.0, Y=0.9519964046486451, Z=-0.30870622678085735]] maxYEdges=[[X=-0.0, Y=1.0011188539924791, Z=0.0]] minZEdges=[] maxZEdges=[] {code} "Notable points" are places where the plane intersections also intersect the world. There are none of these, as expected. The planes that intersect the world are minY and maxY. We do *not* see intersections for minX, though, and we expected to. That's got to be researched to figure out why. It may be because the intersection is actually outside the solid bounds as determined by the Y plane. So the question becomes whether the line (-0.7731590077686981, 0.9519964046486451, t) ever can go through the world? We can surely determine that by picking value 0, and computing the distance to the origin: sqrt(x^2 + y^2 + 0) = 1.2264061340998847885343642874005, which is indeed off the surface. So the points look reasonable. was (Author: kwri...@metacarta.com): Reviewing the solid, and what the edge points *should* be: minx, maxx: -0.7731590077686981, 1.0011188539924791 miny, maxy: 0.9519964046486451, 1.0011188539924791 minz, maxz: -0.9977622932859775, 0.9977599768255027 The minz/maxz planes might touch the world at the poles, but probably don't. The maxx plane might touch the world at the max X pole. The minx plane definitely slices the world, so it should generate at least one point. The maxy plane might touch the world at the max Y pole. The miny plane slices the world, so it should generate at least one point. This is the debugging output: {code} [junit4] 2> notableMinXPoints=[] notableMaxXPoints=[] notableMinYPoints=[] notableMaxYPoints=[] notableMinZPoints=[] notableMaxZPoints=[] [junit4] 2> minXEdges=[] maxXEdges=[] minYEdges=[[X=0.0, Y=0.9519964046486451, Z=-0.30870622678085735]] maxYEdges=[[X=-0.0, Y=1.0011188539924791, Z=0.0]] minZEdges=[] maxZEdges=[] {code} "Notable points" are places where the plane intersections also intersect the world. There are none of these, as expected. The planes that intersect the world are minY and maxY. We do *not* see intersections for minX, though, and we expected to. That's got to be researched to figure out why. It may be because the intersection is actually outside the solid bounds as determined by the Y plane. Out of time for the moment though. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} --
[jira] [Commented] (SOLR-13270) SolrJ does not send "Expect: 100-continue" header
[ https://issues.apache.org/jira/browse/SOLR-13270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777186#comment-16777186 ] Karl Wright commented on SOLR-13270: I just grepped for it and did not find it explicitly set: {code} kawright@1USDKAWRIGHT:/mnt/c/wipgit/lucene4/lucene-solr$ grep -R "setExpectContinue" . --include "*.java" kawright@1USDKAWRIGHT:/mnt/c/wipgit/lucene4/lucene-solr$ {code} I therefore believe it's being set because the RequestConfig is being overwritten. And, sure enough: {code} kawright@1USDKAWRIGHT:/mnt/c/wipgit/lucene4/lucene-solr$ grep -R "setDefaultRequestConfig" . --include "*.java" ./lucene/replicator/src/java/org/apache/lucene/replicator/http/HttpClientBase.java: httpc = HttpClientBuilder.create().setConnectionManager(conMgr).setDefaultRequestConfig(this.defaultConfig).build(); ./solr/solrj/src/java/org/apache/solr/client/solrj/impl/HttpClientUtil.java: HttpClientBuilder retBuilder = builder.setDefaultRequestConfig(requestConfig); kawright@1USDKAWRIGHT:/mnt/c/wipgit/lucene4/lucene-solr$ {code} > SolrJ does not send "Expect: 100-continue" header > - > > Key: SOLR-13270 > URL: https://issues.apache.org/jira/browse/SOLR-13270 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.7 >Reporter: Erlend Garåsen >Priority: Major > > SolrJ does not set the "Expect: 100-continue" header, even though it's > configured in HttpClient: > {code:java} > builder.setDefaultRequestConfig(RequestConfig.custom().setExpectContinueEnabled(true).build());{code} > A HttpClient developer has reviewed the code and says we're setting up > the client correctly, so we have a reason to believe there is a bug in > SolrJ. It's actually a problem we are facing in ManifoldCF, explained in: > https://issues.apache.org/jira/browse/CONNECTORS-1564 > The problem can be reproduced by building and running the following small > Maven project: > [http://folk.uio.no/erlendfg/solr/missing-header.zip] > The application runs SolrJ code where the header does not show up and > HttpClient code where the header is present. > > {code:java} > HttpClientBuilder builder = HttpClients.custom(); > // This should add an Expect: 100-continue header: > builder.setDefaultRequestConfig(RequestConfig.custom().setExpectContinueEnabled(true).build()); > HttpClient httpClient = builder.build(); > // Start Solr and create a core named "test". > String baseUrl = "http://localhost:8983/solr/test;; > // Test using SolrJ — no expect 100 header > HttpSolrClient client = new HttpSolrClient.Builder() > .withHttpClient(httpClient) > .withBaseSolrUrl(baseUrl).build(); > SolrQuery query = new SolrQuery(); > query.setQuery("*:*"); > client.query(query); > // Test using HttpClient directly — expect 100 header shows up: > HttpPost httpPost = new HttpPost(baseUrl); > HttpEntity entity = new InputStreamEntity(new > ByteArrayInputStream("test".getBytes())); > httpPost.setEntity(entity); > httpClient.execute(httpPost); > {code} > When using the last HttpClient test, the expect 100 header appears in > missing-header.log: > {noformat} > http-outgoing-1 >> Expect: 100-continue{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13270) SolrJ does not send "Expect: 100-continue" header
[ https://issues.apache.org/jira/browse/SOLR-13270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777113#comment-16777113 ] Karl Wright commented on SOLR-13270: Hi [~erlendfg], can you identify where in the SolrJ code it explicitly sets expect/continue to "off"? It must be there somewhere. > SolrJ does not send "Expect: 100-continue" header > - > > Key: SOLR-13270 > URL: https://issues.apache.org/jira/browse/SOLR-13270 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.7 >Reporter: Erlend Garåsen >Priority: Major > > SolrJ does not set the "Expect: 100-continue" header, even though it's > configured in HttpClient: > {code:java} > builder.setDefaultRequestConfig(RequestConfig.custom().setExpectContinueEnabled(true).build());{code} > A HttpClient developer has reviewed the code and says we're setting up > the client correctly, so we have a reason to believe there is a bug in > SolrJ. It's actually a problem we are facing in ManifoldCF, explained in: > https://issues.apache.org/jira/browse/CONNECTORS-1564 > The problem can be reproduced by building and running the following small > Maven project: > [http://folk.uio.no/erlendfg/solr/missing-header.zip] > The application runs SolrJ code where the header does not show up and > HttpClient code where the header is present. > > {code:java} > HttpClientBuilder builder = HttpClients.custom(); > // This should add an Expect: 100-continue header: > builder.setDefaultRequestConfig(RequestConfig.custom().setExpectContinueEnabled(true).build()); > HttpClient httpClient = builder.build(); > // Start Solr and create a core named "test". > String baseUrl = "http://localhost:8983/solr/test;; > // Test using SolrJ — no expect 100 header > HttpSolrClient client = new HttpSolrClient.Builder() > .withHttpClient(httpClient) > .withBaseSolrUrl(baseUrl).build(); > SolrQuery query = new SolrQuery(); > query.setQuery("*:*"); > client.query(query); > // Test using HttpClient directly — expect 100 header shows up: > HttpPost httpPost = new HttpPost(baseUrl); > HttpEntity entity = new InputStreamEntity(new > ByteArrayInputStream("test".getBytes())); > httpPost.setEntity(entity); > httpClient.execute(httpPost); > {code} > When using the last HttpClient test, the expect 100 header appears in > missing-header.log: > {noformat} > http-outgoing-1 >> Expect: 100-continue{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776975#comment-16776975 ] Karl Wright commented on LUCENE-8696: - [~jpountz], should be addressed now. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776899#comment-16776899 ] Karl Wright edited comment on LUCENE-8696 at 2/25/19 2:39 PM: -- Reviewing the solid, and what the edge points *should* be: minx, maxx: -0.7731590077686981, 1.0011188539924791 miny, maxy: 0.9519964046486451, 1.0011188539924791 minz, maxz: -0.9977622932859775, 0.9977599768255027 The minz/maxz planes might touch the world at the poles, but probably don't. The maxx plane might touch the world at the max X pole. The minx plane definitely slices the world, so it should generate at least one point. The maxy plane might touch the world at the max Y pole. The miny plane slices the world, so it should generate at least one point. This is the debugging output: {code} [junit4] 2> notableMinXPoints=[] notableMaxXPoints=[] notableMinYPoints=[] notableMaxYPoints=[] notableMinZPoints=[] notableMaxZPoints=[] [junit4] 2> minXEdges=[] maxXEdges=[] minYEdges=[[X=0.0, Y=0.9519964046486451, Z=-0.30870622678085735]] maxYEdges=[[X=-0.0, Y=1.0011188539924791, Z=0.0]] minZEdges=[] maxZEdges=[] {code} "Notable points" are places where the plane intersections also intersect the world. There are none of these, as expected. The planes that intersect the world are minY and maxY. We do *not* see intersections for minX, though, and we expected to. That's got to be researched to figure out why. It may be because the intersection is actually outside the solid bounds as determined by the Y plane. Out of time for the moment though. was (Author: kwri...@metacarta.com): Reviewing the solid, and what the edge points *should* be: minx, maxx: -0.7731590077686981, 1.0011188539924791 miny, maxy: 0.9519964046486451, 1.0011188539924791 minz, maxz: -0.9977622932859775, 0.9977599768255027 The minz/maxz planes might touch the world at the poles, but probably don't. The maxx plane might touch the world at the max X pole. The minx plane definitely slices the world, so it should generate at least one point. The maxy plane might touch the world at the max Y pole. The miny plane slices the world, so it should generate at least one point. We therefore should expect a minimum of two points, which is what we see. If any of these planes actually encounters the pole, though, we should have gotten another point from that. The maxZ plane looks potentially like it might qualify. Out of time for the moment though. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776899#comment-16776899 ] Karl Wright commented on LUCENE-8696: - Reviewing the solid, and what the edge points *should* be: minx, maxx: -0.7731590077686981, 1.0011188539924791 miny, maxy: 0.9519964046486451, 1.0011188539924791 minz, maxz: -0.9977622932859775, 0.9977599768255027 The minz/maxz planes might touch the world at the poles, but probably don't. The maxx plane might touch the world at the max X pole. The minx plane definitely slices the world, so it should generate at least one point. The maxy plane might touch the world at the max Y pole. The miny plane slices the world, so it should generate at least one point. We therefore should expect a minimum of two points, which is what we see. If any of these planes actually encounters the pole, though, we should have gotten another point from that. The maxZ plane looks potentially like it might qualify. Out of time for the moment though. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776881#comment-16776881 ] Karl Wright commented on LUCENE-8696: - Reviewing the solid edge point logic finds nothing wrong. Will try to rule out numerical precision problems next. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776841#comment-16776841 ] Karl Wright commented on LUCENE-8696: - I've verified that there are two solid edge points and they both lie within the path: {code} [junit4] 2> solid edge point [X=0.0, Y=0.9519964046486451, Z=-0.30870622678085735] path.isWithin()? true [junit4] 2> solid edge point [X=-0.0, Y=1.0011188539924791, Z=0.0] path.isWithin()? true [junit4] 2> path edge point [X=0.22516844226485835, Y=0.003930329545205224, Z=0.9721897091178435] isWithin()? false minx=0.9983274500335564 maxx=-0.7759504117276208 miny=-0.9480660751034399 maxy=-0.9971885244472739 minz=1.969952002403821 maxz=-0.025570267707659133 {code} So this confirms that there is no intersection detected, and how the conclusion that the solid is completely within the path is arrived at. Possible errors that would cause this: (1) We might be missing a solid edge point. These edge points are computed based on the lines of intersection between adjoining solid planes and the surface of the world. There is also special computation to handle the case where a solid edge plane intersects the world by itself, but this logic might not be complete. We need to capture all plane/world intersection closed curves and come up with an example point for each. (2) There might be numerical precision issues with intersection computation that prevent us from concluding that the path edges intersect the solid edges. I still have to figure out which is the real problem here. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776821#comment-16776821 ] Karl Wright commented on LUCENE-8696: - Looking at the actual failure now. Basically, problem is that the relationship between the XYZSolid and the GeoPath is containment: the XYZSolid is reported to be inside the GeoPath. It reaches this conclusion because it detects no intersections between the solid and the path edges, and because the path edge point it is using is outside the solid: {code} [junit4] 1> in isShapeInsideArea [junit4] 1> there are 1 pathPoints [junit4] 1> pathpoint [X=0.22516844226485835, Y=0.003930329545205224, Z=0.9721897091178435]... [junit4] 1> outside {code} Haven't verified it yet, but this implies that at least one of the solid's surface points is inside of the path too. Still too early to know which conclusion is incorrect. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776637#comment-16776637 ] Karl Wright commented on LUCENE-8696: - I revised the simple test case to match the actual failure, and committed it with @AwaitsFix. I'm now committing to master and to master, branch_7x, and branch_8x. No further fixes for branch_6x. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776617#comment-16776617 ] Karl Wright commented on LUCENE-8696: - [~ivera], I'm looking at your test case for reproducing the original failure and I honestly can't find any place in testGeo3DRelations where we expect two paths with different widths to exactly fit inside one another. The only relationships that are computed in this test are between an xyz solid and a path. Can you describe how you came up with the simplified test case? > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776010#comment-16776010 ] Karl Wright commented on LUCENE-8696: - More debugging shows that the second circle plane is wildly different in the two runs: {code} [junit4] 1> Checking 'iswithin' for 0.020717830200521595 0.9523290534985549 0.30699177254488114 [junit4] 1> pathPoint... [junit4] 1> outside of circle [A=0.9998476951745469, B=0.01745240539714465, C=-0.0, D=-0.5409068252602056, side=1.0] [junit4] 1> pathPoint... [junit4] 1> passes circle plane [A=0.7071067811865476, B=-0.7071067811865476, C=0.0, D=0.05929892163149414, side=-1.0] [junit4] 1> within! [junit4] 1> Checking 'iswithin' for 0.020717830200521595 0.9523290534985549 0.30699177254488114 [junit4] 1> pathPoint... [junit4] 1> outside of circle [A=0.9998476951745469, B=0.017452405397144648, C=-0.0, D=-0.22520274172912894, side=1.0] [junit4] 1> pathPoint... [junit4] 1> outside of circle [A=0.7863183388224225, B=-0.6178215519319035, C=0.0, D=-0.0021572780909792644, side=1.0] [junit4] 1> pathPoint... [junit4] 1> outside of cutoff plane [A=0.6045468388328157, B=-0.796569594986684, C=-3.0241383426688587E-48, D=0.0, side=1.0] [junit4] 1> pathPoint... [junit4] 1> outside of cutoff plane [A=-0.6885949363624547, B=-0.29030954074708304, C=-0.6644978436136604, D=0.0, side=1.0] [junit4] 1> segment... [junit4] 1> segment... [junit4] 1> segment... {code} For the successful run it's: [A=0.7071067811865476, B=-0.7071067811865476, C=0.0, D=0.05929892163149414, side=-1.0] For the failed run it's: [A=0.7863183388224225, B=-0.6178215519319035, C=0.0, D=-0.0021572780909792644, side=1.0] The naive expectation would be that the vector is identical (A,B,C), but the displacement differs (D). But because this is WGS84, that expectation is incorrect, because oblateness can affect the vector. Because of oblateness, the circle is constructed from three of the four points where the segment edges intersect. Which three it picks is random, but the hope is that the selection is not important. What this shows is that very wide paths on oblate spheroids are mathematically unrelatable to each other. This is not exactly surprising in retrospect; paths were originally designed for a SPHERE world and retrofitting them to WGS84 involved compromises. I therefore think the best approach might be to modify the test suite to limit the width of paths tested on WGS84. [~ivera], what do you think? > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776006#comment-16776006 ] Karl Wright commented on LUCENE-8696: - Added some simple diagnostics. The difference lies in the construction of the second circle plane: {code} [junit4] 1> Checking 'iswithin' for 0.020717830200521595 0.9523290534985549 0.30699177254488114 [junit4] 1> pathPoint... [junit4] 1> outside of circle [junit4] 1> pathPoint... [junit4] 1> within! [junit4] 1> Checking 'iswithin' for 0.020717830200521595 0.9523290534985549 0.30699177254488114 [junit4] 1> pathPoint... [junit4] 1> outside of circle [junit4] 1> pathPoint... [junit4] 1> outside of circle [junit4] 1> pathPoint... [junit4] 1> outside of cutoff plane [A=0.6045468388328157, B=-0.796569594986684, C=-3.0241383426688587E-48, D=0.0, side=1.0] [junit4] 1> pathPoint... [junit4] 1> outside of cutoff plane [A=-0.6885949363624547, B=-0.29030954074708304, C=-0.6644978436136604, D=0.0, side=1.0] [junit4] 1> segment... [junit4] 1> segment... [junit4] 1> segment... {code} So the second circle plane accepts the point in the narrower case, but rejects it in the wider case. Digging further. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776002#comment-16776002 ] Karl Wright commented on LUCENE-8696: - Hmm, even when I use createSurfacePoint() with this point, it still fails. So I need to look deeper. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775839#comment-16775839 ] Karl Wright commented on LUCENE-8696: - Preliminary results indicate that the problem may be due to the fact that the point isn't on the surface. The following test fails: {code} GeoPoint check = new GeoPoint(0.02071783020158524, 0.9523290535474472, 0.30699177256064203); assertTrue(PlanetModel.WGS84.pointOnSurface(check)); {code} Because path geometry uses surface circles and parallel slicing planes, they can be particularly susceptible to misconstruing membership for points that are off the world. I'll try to confirm this picture. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (CONNECTORS-1587) Unable to Crawl Documents Meta data
[ https://issues.apache.org/jira/browse/CONNECTORS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775088#comment-16775088 ] Karl Wright commented on CONNECTORS-1587: - Can you amend your ticket to tell us what connectors you are using for your job? This ticket is very nearly incomprehensible, and unless it is amended I will close it on that basis. > Unable to Crawl Documents Meta data > --- > > Key: CONNECTORS-1587 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1587 > Project: ManifoldCF > Issue Type: Bug >Reporter: Pavithra Dhakshinamurthy >Priority: Major > > I tried to crawl the meta data of Document section. but cannot able to crawl > the data. > I have facing error stating that " The query cannot be completed because the > number of lookup columns it contains exceeds the lookup column threshold > enforced by the administrator." > How can I resolve this issue.Is there any config needs for that. > Please assist the same. > While checking for documentation mentioned the meta data contents as > drop-down type but my connector(Manifold 2.9.1) is different. Is there any > version update is there for this lookup. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775050#comment-16775050 ] Karl Wright commented on LUCENE-8696: - The path in the test retraces its steps, but that should not be a problem for membership testing. I'll look into it starting this evening. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8696.patch > > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774309#comment-16774309 ] Karl Wright commented on CONNECTORS-1584: - Have you subscribed to the list? Instructions are in the documentation for "contact us". You send mail to: user-subscr...@manifoldcf.apache.org > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: > {code:java} > .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774105#comment-16774105 ] Karl Wright commented on CONNECTORS-1584: - Actually, it *is* user@ but so many people get mixed up with that that I got it backwards myself. What failure notice did you get when you mailed to user@? I receive email from this list a dozen times a day or more so I am not sure why you'd be having trouble. > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: > {code:java} > .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772995#comment-16772995 ] Karl Wright commented on CONNECTORS-1563: - [~Subasini], we are trying to debug your setup. The first principle of debugging is to identify where exactly the problem is occurring. It eliminates one variable. The file system connector is quite simple and has few configuration options, so it should be easy to set something up we can use to evaluate your solr connection. Thanks, Karl > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, Manifold and Solr > settings_CustomField.docx, managed-schema, manifold settings.docx, > manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772912#comment-16772912 ] Karl Wright commented on CONNECTORS-1563: - [~Subasini], the "error" is because it does not recognize a specific translation bundle for your language, so it defaults to English. It is harmless. I asked you to *try* working with a File System connection initially to narrow down where your problems were coming from. Please do so. [~shinichiro abe] and myself both tried a configuration similar to the one you report end of last year when we were debugging the 2.11 release of ManifoldCF. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, Manifold and Solr > settings_CustomField.docx, managed-schema, manifold settings.docx, > manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772729#comment-16772729 ] Karl Wright commented on CONNECTORS-1563: - In general in cases like this I recommend that people start with the simplest possible working configuration and then modify it until they achieve their goals. In this case that would mean starting with a file system job and a freshly-installed Solr instance, with no other changes whatsoever. [~shinichiro abe], can you help Mr. Rath by trying MCF 2.12 with a fresh single-process Solr instance, using the "/update" handler? He claims that this does not work and I do not have any time to work with him for the next few weeks. If it works for you please provide detailed steps describing what you did. Thanks in advance! > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, Manifold and Solr > settings_CustomField.docx, managed-schema, manifold settings.docx, > manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure
[ https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771682#comment-16771682 ] Karl Wright commented on LUCENE-8696: - [~ivera], would you be willing to construct a simple test case? I can't possibly look at this until the weekend, but it would help. > TestGeo3DPoint.testGeo3DRelations failure > - > > Key: LUCENE-8696 > URL: https://issues.apache.org/jira/browse/LUCENE-8696 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Karl Wright >Priority: Major > > Reproduce with: > {code:java} > ant test -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations > -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1{code} > Error: > {code:java} > [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<< > [junit4] > Throwable #1: java.lang.AssertionError: invalid hits for > shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, > width=1.3439035240356338(77.01), > points={[[lat=2.4457272005608357E-47, > lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, > Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, > lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, > Z=2.448463612203698E-47])], [lat=-0.7718789008737459, > lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, > Z=-0.6971214014446648])]]}}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771663#comment-16771663 ] Karl Wright commented on CONNECTORS-1563: - Hi Subasini, Are you now Tika-extracting in ManifoldCF, or in Solr? The text field looks like it contains properly extracted content, along with other stuff you do not want. Is this correct? If the extraction is happening in Solr, then I have no idea what this is coming from. If the extraction is happening in ManifoldCF, then if you have placed a Metadata Adjuster transformer in the pipeline between the Tika Extractor and the Solr Output Connector, I'd say you had set it up to concatenate many fields together into a text field. The Metadata Adjuster has that ability. The choice of how metadata (or content) fields get mapped to Solr schema is set up in your Solr output connection configuration. The Tika extraction basically replaces a binary input document with a character-sequence output document plus metadata fields. The character-sequence output document then must be sent to Solr not using the exracting update handler, but just the standard handler, so the handler should be changed from /update/extract to just /update, and the "Use extracting update handler" should be turned off. The actual field name used for the extracted content body can also be changed, if desired, in the "Schema" part of the configuration. But what is there by default works with Solr as it's set up by default. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: Document simple history.docx, managed-schema, manifold > settings.docx, manifoldcf.log, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1584. - Resolution: Not A Problem > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: > {code:java} > .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)