[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

2019-04-26 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826879#comment-16826879
 ] 

Karl Wright commented on CONNECTORS-1602:
-

[~DonaldVdD] ManifoldCF keeps a queue of documents which it recrawls.  The 
crawling is only completed when all the documents are no longer in a state 
where they need to be fetched.  For a continuous job, all documents once 
fetched are immediately requeued, so this never happens.

As for session-based login, if you set up your login sequence properly, so that 
when a document is fetched that needs a fresh cookie, the login will take place 
at that point and a new cookie will be used.


> Continuous crawling doesn't recrawl everything
> --
>
> Key: CONNECTORS-1602
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1602
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Donald Van den Driessche
>Priority: Major
>
> When crawling a website in continuous crawling mode we saw that not all 
> documents are recrawled.
> The site is quite extensive. We figured out that after crawling a 
> document/page gets a recrawl timestamp in between the recrawl interval and 
> max recrawl interval.
> But if these values occur within the first crawl, Manifold starts recrawling 
> those, but seems to ignore the rest of the website. Also sometimes documents 
> get recrawled 5 times while other don't get recrawled. Apparently due to the 
> same issue.
>  
> Is it possible to shed a bit more light on the continuous crawling?
> Is it a good system to use for crawling a (extensive) website?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

2019-04-26 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826725#comment-16826725
 ] 

Karl Wright commented on CONNECTORS-1602:
-

Hi [~DonaldVdD], MCF keeps crude statistics on how often the doc changes.  As I 
said, it gets recrawled *eventually*, and if it does not change, the time is 
doubled until the next crawl, up to the maximum the job is configured for.

As for when the job "stops", the continuous crawl jobs do not stop.  They run 
indefinitely until manually aborted.


> Continuous crawling doesn't recrawl everything
> --
>
> Key: CONNECTORS-1602
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1602
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Donald Van den Driessche
>Priority: Major
>
> When crawling a website in continuous crawling mode we saw that not all 
> documents are recrawled.
> The site is quite extensive. We figured out that after crawling a 
> document/page gets a recrawl timestamp in between the recrawl interval and 
> max recrawl interval.
> But if these values occur within the first crawl, Manifold starts recrawling 
> those, but seems to ignore the rest of the website. Also sometimes documents 
> get recrawled 5 times while other don't get recrawled. Apparently due to the 
> same issue.
>  
> Is it possible to shed a bit more light on the continuous crawling?
> Is it a good system to use for crawling a (extensive) website?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

2019-04-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826349#comment-16826349
 ] 

Karl Wright commented on CONNECTORS-1602:
-

Continuous crawling bases the next crawl time on the last time the document 
changed.  In general it doubles the crawling interval, up to the maximum, 
before retrying.  So if your document doesn't change very often, the crawler 
may wait quite some time before reviewing it.

The best way to see what it is going to do is to find the document in the 
Document Status report, and see when ManifoldCF intends to recrawl it.



> Continuous crawling doesn't recrawl everything
> --
>
> Key: CONNECTORS-1602
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1602
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Donald Van den Driessche
>Priority: Major
>
> When crawling a website in continuous crawling mode we saw that not all 
> documents are recrawled.
> The site is quite extensive. We figured out that after crawling a 
> document/page gets a recrawl timestamp in between the recrawl interval and 
> max recrawl interval.
> But if these values occur within the first crawl, Manifold starts recrawling 
> those, but seems to ignore the rest of the website. Also sometimes documents 
> get recrawled 5 times while other don't get recrawled. Apparently due to the 
> same issue.
>  
> Is it possible to shed a bit more light on the continuous crawling?
> Is it a good system to use for crawling a (extensive) website?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1600) Add support for configuring JCIFS connector's resilience to SMB exceptions before throwing a ServiceInterruption

2019-04-23 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823915#comment-16823915
 ] 

Karl Wright commented on CONNECTORS-1600:
-

How many times did you have it retry?


> Add support for configuring JCIFS connector's resilience to SMB exceptions 
> before throwing a ServiceInterruption
> 
>
> Key: CONNECTORS-1600
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1600
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tang Huan Song
>Priority: Major
>
> This is a improvement request regarding the JCIFS(-ng) connector's exception 
> handling behavior.
> After examining the JCIFS connector code, I've found that the number of 
> retries given consecutive identical SMB exceptions and the total number of 
> retries per file/request is hardcoded within the connector at 
> retriesRemaining=3 and totalTries=5 respectively.
> Depending on the amount of traffic a file server regularly handles, the 
> probability of any given SMB request failing, and correspondingly the total 
> number of SMB request failures for a given file request will vary. As a 
> result, the current hardcoded values may cause ManifoldCF to abandon the job 
> in the event of high traffic.
> I would like to suggest making these values configurable, as a connector-wide 
> setting modified via ManifoldCF's properties.xml or a per-connection setting 
> modified via the corresponding repository connection's page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1498) Support SMBv2/v3 protocol for Windows Shares connector

2019-04-23 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823800#comment-16823800
 ] 

Karl Wright commented on CONNECTORS-1498:
-

Hi,
The following settings were suggested by Michael Allen, original developer of 
Jcifs:

{code}
jcifs.smb.client.soTimeout: 15
jcifs.smb.client.responseTimeout: 12
jcifs.resolveOrder: LMHOSTS,DNS,WINS
jcifs.smb.client.listCount: 20
jcifs.smb.client.dfs.strictView: true
{code}

The timeout values he chose were based on the behavior of the protocol, and 
were the maximum values he felt were reasonable given that.  The resolve order 
was chosen based on his sense of how fast each one of these is.  The listCount 
is also based on protocol considerations, and strictView is required for proper 
functioning of the connector.


> Support SMBv2/v3 protocol for Windows Shares connector
> --
>
> Key: CONNECTORS-1498
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1498
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
> Environment: OS: CentOS 7.2
> ManifoldCF: 2.8.1
>Reporter: Hiroaki Takasu
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.10
>
>
> Windows Shares connector  (JCIFS connector) uses 
> [JCIFS|https://jcifs.samba.org/] library which supports only samba protocol 
> v1.
> But many file servers were disabled SMBv1 by vulnerability 
> [MS17-010|https://docs.microsoft.com/en-us/security-updates/SecurityBulletins/2017/ms17-010],
>  so we can not use Windows Shares connector.
> I hope that ManifoldCF support SMBv2/v3 with other CIFS library (e.g. 
> [smbj|https://github.com/hierynomus/smbj])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1600) Add support for configuring JCIFS connector's resilience to SMB exceptions before throwing a ServiceInterruption

2019-04-23 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1600.
-
Resolution: Won't Fix

Not a good idea in my opinion

> Add support for configuring JCIFS connector's resilience to SMB exceptions 
> before throwing a ServiceInterruption
> 
>
> Key: CONNECTORS-1600
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1600
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tang Huan Song
>Priority: Major
>
> This is a improvement request regarding the JCIFS(-ng) connector's exception 
> handling behavior.
> After examining the JCIFS connector code, I've found that the number of 
> retries given consecutive identical SMB exceptions and the total number of 
> retries per file/request is hardcoded within the connector at 
> retriesRemaining=3 and totalTries=5 respectively.
> Depending on the amount of traffic a file server regularly handles, the 
> probability of any given SMB request failing, and correspondingly the total 
> number of SMB request failures for a given file request will vary. As a 
> result, the current hardcoded values may cause ManifoldCF to abandon the job 
> in the event of high traffic.
> I would like to suggest making these values configurable, as a connector-wide 
> setting modified via ManifoldCF's properties.xml or a per-connection setting 
> modified via the corresponding repository connection's page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1600) Add support for configuring JCIFS connector's resilience to SMB exceptions before throwing a ServiceInterruption

2019-04-23 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823794#comment-16823794
 ] 

Karl Wright commented on CONNECTORS-1600:
-

This is, I would suggest, overkill.  It is not only a bad idea to make 
everything possible be configurable in the UI, but the benefit of doing so is 
small to non-existent.  In 10 years of supporting this connector, what I've 
learned is that whenever people say that they just need more retries is that 
something else is very wrong, and the added retries do not help.

If you ever run into a situation where 4 retries succeeds and 3 doesn't, please 
let me know.


> Add support for configuring JCIFS connector's resilience to SMB exceptions 
> before throwing a ServiceInterruption
> 
>
> Key: CONNECTORS-1600
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1600
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tang Huan Song
>Priority: Major
>
> This is a improvement request regarding the JCIFS(-ng) connector's exception 
> handling behavior.
> After examining the JCIFS connector code, I've found that the number of 
> retries given consecutive identical SMB exceptions and the total number of 
> retries per file/request is hardcoded within the connector at 
> retriesRemaining=3 and totalTries=5 respectively.
> Depending on the amount of traffic a file server regularly handles, the 
> probability of any given SMB request failing, and correspondingly the total 
> number of SMB request failures for a given file request will vary. As a 
> result, the current hardcoded values may cause ManifoldCF to abandon the job 
> in the event of high traffic.
> I would like to suggest making these values configurable, as a connector-wide 
> setting modified via ManifoldCF's properties.xml or a per-connection setting 
> modified via the corresponding repository connection's page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1498) Support SMBv2/v3 protocol for Windows Shares connector

2019-04-23 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823745#comment-16823745
 ] 

Karl Wright commented on CONNECTORS-1498:
-

Some of these settings should not be changed from a specific value.  Other 
settings I could see changing on a per-connection basis.

jcifs-ng does give the ability to control settings per connection, unlike the 
original jcifs.  But I think that the ones that should be connection-based 
would be only:

- jcifs.resolveOrder
- jcifs.smb.client.minVersion
- jcifs.smb.client.maxVersion
- jcifs.smb.client.ipcSigningEnforced

The first thing to do is verify that the ported code works properly though.


> Support SMBv2/v3 protocol for Windows Shares connector
> --
>
> Key: CONNECTORS-1498
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1498
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
> Environment: OS: CentOS 7.2
> ManifoldCF: 2.8.1
>Reporter: Hiroaki Takasu
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.10
>
>
> Windows Shares connector  (JCIFS connector) uses 
> [JCIFS|https://jcifs.samba.org/] library which supports only samba protocol 
> v1.
> But many file servers were disabled SMBv1 by vulnerability 
> [MS17-010|https://docs.microsoft.com/en-us/security-updates/SecurityBulletins/2017/ms17-010],
>  so we can not use Windows Shares connector.
> I hope that ManifoldCF support SMBv2/v3 with other CIFS library (e.g. 
> [smbj|https://github.com/hierynomus/smbj])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-19 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821823#comment-16821823
 ] 

Karl Wright commented on CONNECTORS-1449:
-

Well, this is the method you are overriding, and its signature looks different 
from the one you propose:

{code}
GetListItemsResponseGetListItemsResult items =  stub1.getListItems(docLibrary, 
"", q, viewFields, "1", buildNonPagingQueryOptions(), null);
{code}

docLibrary is the document library GUID, q is a GetListItemsQuery, viewFields 
is a GetListItemsViewFields which describes which fields to return, and the 
next argument is the listing options, and I don't know what the last argument 
is.



> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1498) Support SMBv2/v3 protocol for Windows Shares connector

2019-04-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821570#comment-16821570
 ] 

Karl Wright commented on CONNECTORS-1498:
-

Ok, I got past the build problems and made (what I think) are the necessary 
changes.

Please check out 
https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1498 and try 
it.  Thanks!


> Support SMBv2/v3 protocol for Windows Shares connector
> --
>
> Key: CONNECTORS-1498
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1498
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
> Environment: OS: CentOS 7.2
> ManifoldCF: 2.8.1
>Reporter: Hiroaki Takasu
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.10
>
>
> Windows Shares connector  (JCIFS connector) uses 
> [JCIFS|https://jcifs.samba.org/] library which supports only samba protocol 
> v1.
> But many file servers were disabled SMBv1 by vulnerability 
> [MS17-010|https://docs.microsoft.com/en-us/security-updates/SecurityBulletins/2017/ms17-010],
>  so we can not use Windows Shares connector.
> I hope that ManifoldCF support SMBv2/v3 with other CIFS library (e.g. 
> [smbj|https://github.com/hierynomus/smbj])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1498) Support SMBv2/v3 protocol for Windows Shares connector

2019-04-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821562#comment-16821562
 ] 

Karl Wright commented on CONNECTORS-1498:
-

[~hhoechtl], I just tried this.  The compilation errors are the following:

{code}
compile-connector:
[javac] C:\wip\mcf\trunk\dist\connector-build.xml:594: warning: 
'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to 
false for repeatable builds
[javac] Compiling 6 source files to 
C:\wip\mcf\trunk\connectors\jcifs\build\connector\classes
[javac] 
C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:20:
 warning: [deprecation] NtlmPasswordAuthentication in jcifs.smb has been 
deprecated
[javac] import jcifs.smb.NtlmPasswordAuthentication;
[javac] ^
[javac] 
C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveHelpers.java:35:
 warning: [deprecation] NtlmPasswordAuthentication in jcifs.smb has been 
deprecated
[javac] import jcifs.smb.NtlmPasswordAuthentication;
[javac] ^
[javac] 
C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:19:
 error: cannot find symbol
[javac] import jcifs.smb.ACE;
[javac] ^
[javac]   symbol:   class ACE
[javac]   location: package jcifs.smb
[javac] 
C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:20:
 warning: [deprecation] NtlmPasswordAuthentication in jcifs.smb has been 
deprecated
[javac] import jcifs.smb.NtlmPasswordAuthentication;
[javac] ^
[javac] 
C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:1246:
 error: cannot find symbol
[javac]   protected void convertACEs(List allowList, List 
denyList, ACE[] aces)
[javac] 
^
[javac]   symbol:   class ACE
[javac]   location: class SharedDriveConnector
[javac] 
C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:2404:
 error: cannot find symbol
[javac]   protected static ACE[] getFileSecurity(SmbFile file, boolean 
useSIDs)
[javac]^
[javac]   symbol:   class ACE
[javac]   location: class SharedDriveConnector
[javac] 
C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveConnector.java:2442:
 error: cannot find symbol
[javac]   protected static ACE[] getFileShareSecurity(SmbFile file, boolean 
useSIDs)
[javac]^
[javac]   symbol:   class ACE
[javac]   location: class SharedDriveConnector
[javac] 
C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveHelpers.java:34:
 error: cannot find symbol
[javac] import jcifs.smb.ACE;
[javac] ^
[javac]   symbol:   class ACE
[javac]   location: package jcifs.smb
[javac] 
C:\wip\mcf\trunk\connectors\jcifs\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\sharedrive\SharedDriveHelpers.java:35:
 warning: [deprecation] NtlmPasswordAuthentication in jcifs.smb has been 
deprecated
[javac] import jcifs.smb.NtlmPasswordAuthentication;
[javac] ^
[javac] 5 errors
[javac] 4 warnings
{code}

The thing that stops it from compiling is that there is no longer a class 
jcifs.smb.ACE.  This is an access-control list element.  It's obviously 
critical to the functioning of ManifoldCF to have that.

Can you research what happened to this in jcifs-ng?  Did they just rename it, 
or did they completely remove it?



> Support SMBv2/v3 protocol for Windows Shares connector
> --
>
> Key: CONNECTORS-1498
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1498
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
> Environment: OS: CentOS 7.2
> ManifoldCF: 2.8.1
>Reporter: Hiroaki Takasu
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.10
>
>
> Windows Shares connector  (JCIFS connector) uses 
> [JCIFS|https://jcifs.samba.org/] library which supports only samba protocol 
> v1.
> But many file servers were disabled SMBv1 by vulnerability 
> [MS17-010|https://docs.microsoft.com/en-us/security-updates/SecurityBulletins/2017/ms17-010],
>  so we 

[jira] [Commented] (CONNECTORS-1498) Support SMBv2/v3 protocol for Windows Shares connector

2019-04-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821122#comment-16821122
 ] 

Karl Wright commented on CONNECTORS-1498:
-

[~hhoechtl], this looks like a reimplementation of JCIFS which in theory should 
support the same original JCIFS API.  It is LGPL, which is consistent with the 
original JCIFS, so I believe it includes much of the original code.

It might well work by just downloading the jar and treating it just like the 
original JCIFS.  Have you tried this?


> Support SMBv2/v3 protocol for Windows Shares connector
> --
>
> Key: CONNECTORS-1498
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1498
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
> Environment: OS: CentOS 7.2
> ManifoldCF: 2.8.1
>Reporter: Hiroaki Takasu
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.10
>
>
> Windows Shares connector  (JCIFS connector) uses 
> [JCIFS|https://jcifs.samba.org/] library which supports only samba protocol 
> v1.
> But many file servers were disabled SMBv1 by vulnerability 
> [MS17-010|https://docs.microsoft.com/en-us/security-updates/SecurityBulletins/2017/ms17-010],
>  so we can not use Windows Shares connector.
> I hope that ManifoldCF support SMBv2/v3 with other CIFS library (e.g. 
> [smbj|https://github.com/hierynomus/smbj])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-15 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817880#comment-16817880
 ] 

Karl Wright commented on CONNECTORS-1449:
-

Yes, we could use the ListItems method to retrieve the NoCrawl flag for 
individual listed documents, provided this performs well.  If individual 
metadata requests would need to be made for each listed document, then we'd be 
better off adding a new method that wraps the metadata retrieval method we're 
currently using from Lists and adding the NoCrawl attribute to the response.

Does this make sense?

> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-14 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817462#comment-16817462
 ] 

Karl Wright commented on CONNECTORS-1449:
-

We cannot, and should not, change Lists.aspx -- it's a Microsoft service.
MCPermissions, though, wraps one of the methods in Lists.aspx and presents it 
as a MCPermissions method with fewer restrictions.

Please look at the code for details.


> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816578#comment-16816578
 ] 

Karl Wright commented on CONNECTORS-1449:
-

Hi, the method that is used to get the SOAP for metadata for a document is the 
following:

{code}
metadataValues = proxy.getFieldValues( sortedMetadataFields, 
encodePath(sitePath), listID, "/Lists/" + decodedItemPath.substring(cutoff+1), 
dspStsWorks );

{code}

This calls:

{code}
  {
// SharePoint 2010: Get field values some other way
// Sharepoint 2010; use Lists service instead
ListsWS lservice = new ListsWS(baseUrl + site, userName, password, 
configuration, httpClient );
ListsSoapStub stub1 = (ListsSoapStub)lservice.getListsSoapHandler();

String sitePlusDocId = serverLocation + site + docId;
if (sitePlusDocId.startsWith("/"))
  sitePlusDocId = sitePlusDocId.substring(1);

GetListItemsQuery q = buildMatchQuery("FileRef","Text",sitePlusDocId);
GetListItemsViewFields viewFields = buildViewFields(fieldNames);

GetListItemsResponseGetListItemsResult items =  
stub1.getListItems(docLibrary, "", q, viewFields, "1", 
buildNonPagingQueryOptions(), null);
if (items == null)
  return result;

MessageElement[] list = items.get_any();

final String xmlResponse = list[0].toString();
if (Logging.connectors.isDebugEnabled()){
  Logging.connectors.debug("SharePoint: getListItems FileRef value 
'"+sitePlusDocId+"', xml response: '" + xmlResponse + "'");
}
{code}

So it is calling the Lists service to do this right now (SharePoint 2010 and 
higher).  For SharePoint 2003, it used the dspsts service, but that's been 
broken for a while, and I see no need to support this feature for that version 
of SharePoint.

If you introduce a new service or method, I will also need a configuration 
switch that enables the code that calls it, or backwards compatibility will not 
be maintained.




> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816164#comment-16816164
 ] 

Karl Wright commented on CONNECTORS-1449:
-

The MCPermissions plugin at present furnishes two services: one to get 
permissions for users, and the other to list documents without restrictions 
imposed by SharePoint.  I would propose (if either the dspsts, webs, or 
versions services do not handle this themselves) that we either add a new 
MCPermissions service that wraps whatever is currently used to obtain document 
metadata with one that also adds the "NoCrawl" flag to the result, OR we put it 
in the existing Lists service wrapper we currently have.

Note that the problem isn't going to be adequately addressed unless we can get 
this information on a per-document basis, somehow.  We need to be able to tell 
the framework to delete the document when the connector looks at it.  Doing 
this in a transformation connector won't work for that very same reason: the 
document won't be sent to the transformer unless it's noticed to have been 
changed in some way.  So the repository connector really has to handle this.


> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816017#comment-16816017
 ] 

Karl Wright commented on CONNECTORS-1449:
-

It depends on why you want to avoid crawling something.  If it's to prevent 
fetching it then you can't do it at the transformer level.

But the right solution is to look for it in the SOAP response.

There is another solution, which is to modify the ManifoldCF SharePoint plugin 
for SharePoint 2013 to return it from the modified Lists service.  That would 
involve C# code changes, but would definitely allow us access to the flag in 
the connector.  The code is checked in under 
https://svn.apache.org/repos/asf/manifoldcf/integration/sharepoint-2013/trunk . 
 Have a look.



> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815765#comment-16815765
 ] 

Karl Wright commented on CONNECTORS-1449:
-

[~durai-jira] Rewriting the SharePoint connector to use the REST API is well 
beyond the scope of a bug fix of this kind.

If you want to attempt this work and contribute it to ManifoldCF, I can 
certainly *help*, but I cannot do something of this scale myself, while working 
full time on other things, without even a SharePoint instance to work with.



> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1599) response code 401 still gets deleted with the setting "keep unreachable documents"

2019-04-11 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1599.
-
   Resolution: Not A Problem
 Assignee: Karl Wright
Fix Version/s: ManifoldCF 2.13

Working as designed.


> response code 401 still gets deleted with the setting "keep unreachable 
> documents"
> --
>
> Key: CONNECTORS-1599
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1599
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> Even with the "Hop count mode" set to "keep unreachable documents, 'for now' 
> || forever" manifold deletes documents for which it receives a 401 response 
> code.
> The documentation does not specify such a distinction as described above. Is 
> there some information/configuration that I'm missing? Is there a reasoning 
> behind the guaranteed deletion of a 401?
> Ideally, for our use-case, we would want to remove all documents that return 
> 404, but keep everything which is due the server not responding or the 
> crawler being unauthenticated.
> Is there a way to configure this in a more granular fashion?
> Regards,
> roel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1599) response code 401 still gets deleted with the setting "keep unreachable documents"

2019-04-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815389#comment-16815389
 ] 

Karl Wright commented on CONNECTORS-1599:
-

[~goovaertsr], 401 means the document is not accessible.  This has nothing to 
do with being "unreachable", because "unreachable" means there is no path to it 
from the seeds.


> response code 401 still gets deleted with the setting "keep unreachable 
> documents"
> --
>
> Key: CONNECTORS-1599
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1599
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Major
>
> Even with the "Hop count mode" set to "keep unreachable documents, 'for now' 
> || forever" manifold deletes documents for which it receives a 401 response 
> code.
> The documentation does not specify such a distinction as described above. Is 
> there some information/configuration that I'm missing? Is there a reasoning 
> behind the guaranteed deletion of a 401?
> Ideally, for our use-case, we would want to remove all documents that return 
> 404, but keep everything which is due the server not responding or the 
> crawler being unauthenticated.
> Is there a way to configure this in a more granular fashion?
> Regards,
> roel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815388#comment-16815388
 ] 

Karl Wright commented on CONNECTORS-1449:
-

[~durai-jira], I do not have a sharepoint instance here.  Nor do I have any 
indication that any of the web services included with SharePoint provide the 
value for this flag.  If you can show me how the standard services return the 
flag value, I can easily implement this.  Please let me know.


> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job

2019-04-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814326#comment-16814326
 ] 

Karl Wright commented on CONNECTORS-1592:
-

[~goovaertsr] Yes, if you have no intention of doing hopcount filtering ever, 
then disable hop count filtering forever.  It's far easier on the database.

Having said that, I'm pretty sure you have other problems too.


> Found long running query in manifold scheduled job
> --
>
> Key: CONNECTORS-1592
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1592
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.12
>Reporter: Subasini Rath
>Priority: Major
> Attachments: LongRunningWithPlan_thread39.txt, 
> SELECT_blocked_queries.txt, postgresql.conf, properties.xml
>
>
> Hi Karl,
>    I am also facing the above mentioned issue. (Similar to Connector-880)
> I am using manifold2.12 binary version. I am using Solr output connector and 
> Web repository connection. Manifold is using all default configuration.
> When I am running the jobs manually, it runs fine. Same jobs have been 
> scheduled to run everyday.
> I am getting below exceptions and the job gets hanged/ going to waiting stage.
> Could you please help me in resolving the same.
> I am getting the below error -
> Scenario-1
> WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query 
> (2706114 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a 
> long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query 
> (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)]
>  WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: 
> 'e'
>  WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running 
> query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE 
> needpriority=? LIMIT 1000]
>  WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i'
>  WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062'
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a 
> long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter 
> 0: 'S'
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query 
> (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE]
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R'
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running 
> query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE]
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Parameter 0: 'E'
>  WARN 2019-03-08T23:58:20,483 (qtp550147359-4339) - Found a long-running 
> query (2496641 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: 
> isDistinctSelect=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isGrouped=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isAggregated=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: columns=[ COLUMN: 
> PUBLIC.JOBS.ID not nullable
>  WARN 2019-03-08T23:58:20,492 (qtp550147359-4346) - Found a long-running 
> query (2435908 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: 
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: ]
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: [range variable 1
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: join type=INNER
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: table=SYSTEM_SUBQUERY
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: cardinality=0
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: access=FULL SCAN
>  WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: join condition = 
> [index=SYS_IDX_13329
>  WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: ]
>  WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: ][range variable 2
>  WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: join 

[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-04-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814310#comment-16814310
 ] 

Karl Wright commented on CONNECTORS-1593:
-

Hi [~DonaldVdD], what connector is being used to download the files?  What is 
serving them?  Having the data get corrupted is very very odd; I can't imagine 
have code that does that accidentally.


> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: image-2019-03-22-08-57-53-887.png
>
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> 

[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job

2019-04-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814306#comment-16814306
 ] 

Karl Wright commented on CONNECTORS-1592:
-

{quote}
the largest was 223673ms, the minimum time spent was 172416ms, the others are 
distributed between these extrema
{quote}

I saw a longer-running query than that in the log you posted, some 200ms.  
But the plan was fine.  Once again, locking would have been the only 
explanation.  But if you are seeing no queries running in less than 172416ms, 
then I think you may well have found your problem.  The lion's share of 
Postgresql queries should be executing in well under a second. Times around 
20ms would be typical.  Something is very wrong with your Postgresql 
configuration or installation given that.

{quote}
Just one more question, considering what you said of the hopcount filtering; In 
the "Hop Filters"-tab we have nothing of configuration except for "hop count 
mode" is set to "delete unreachable", which i had interpreted as being the 
default. Is this correct that it is the default, and is there something else we 
could do to disable hop count filtering?
{quote}

That is the default; it's also the most inefficient.  From the manual:

{quote}
On this same tab, you can tell the Framework what to do should there be changes 
in the distance from the root to a document. The choice "Delete unreachable 
documents" requires the Framework to recalculate the distance to every 
potentially affected document whenever a change takes place. This may require 
expensive bookkeeping, however, so you also have the option of ignoring such 
changes. There are two varieties of this latter option - you can ignore the 
changes for now, with the option of turning back on the aggressive bookkeeping 
at a later time, or you can decide not to ever allow changes to propagate, in 
which case the Framework will discard the necessary bookkeeping information 
permanently. This last option is the most efficient.
{quote}


> Found long running query in manifold scheduled job
> --
>
> Key: CONNECTORS-1592
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1592
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.12
>Reporter: Subasini Rath
>Priority: Major
> Attachments: LongRunningWithPlan_thread39.txt, 
> SELECT_blocked_queries.txt, postgresql.conf, properties.xml
>
>
> Hi Karl,
>    I am also facing the above mentioned issue. (Similar to Connector-880)
> I am using manifold2.12 binary version. I am using Solr output connector and 
> Web repository connection. Manifold is using all default configuration.
> When I am running the jobs manually, it runs fine. Same jobs have been 
> scheduled to run everyday.
> I am getting below exceptions and the job gets hanged/ going to waiting stage.
> Could you please help me in resolving the same.
> I am getting the below error -
> Scenario-1
> WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query 
> (2706114 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a 
> long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query 
> (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)]
>  WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: 
> 'e'
>  WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running 
> query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE 
> needpriority=? LIMIT 1000]
>  WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i'
>  WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062'
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a 
> long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter 
> 0: 'S'
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query 
> (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE]
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R'
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running 
> query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE]
>  WARN 2019-03-08T23:58:20,475 (Delete 

[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job

2019-04-09 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813400#comment-16813400
 ] 

Karl Wright commented on CONNECTORS-1592:
-

{quote}
one 'root' query doesn't get committed, this keeps a lock on the job-, 
intrinsiclink- or jobQueue-table and cascades into the bulk of locked queries. 
Main question here is how one query could get stuck; can a query be waiting for 
something from manifold until it is committed?
{quote}

That is certainly possible, but you should see that query logged as a 
very-long-running query in that case.  What is the longest-running query you 
see logged?

{quote}
there is a locking conflict that arises from the jobID being a foreing key 
constraint for both the jobQueue and intrinsiclinks. From debugging we have the 
impression that postgres locks the whole intrinsiclink-table in a query which 
is specified to have one specific jobId.
{quote}

It may do that, but the way Postgresql works is then a SQL exception is thrown, 
and the ManifoldCF code will retry the query.  So this situation cannot cause 
the symptom you are seeing.

The *only* way you can get into this situation is to have one particular query, 
which hits tables that all the other queries depend on, take a very long time.  
And that should show in the log.
If it doesn't show in the log, that means that the locks are being caused 
externally, which is why I pointed at VACUUM FULL as being a potential cause.

{quote}
could using the multi process-functionality of 
org.apache.manifoldcf.usejettyparentclassloader be used to improve this issue?
{quote}

No, won't help.

{quote}
I have read that disabling swap can be good for intensive db-interactions; do 
you have experience with disabling swap improving manifold?
{quote}

Once again, probably immaterial, EXCEPT if your postgresql instances are 
swapping.  That would be bad.

{quote}
is there a possibility that we could set-up a conference call with someone from 
the manifold team?
{quote}

I work full time on an entirely unrelated task and probably there is nobody 
else who would be in a position to go deep on this issue.  So this is unlikely.

One thing I notice, though, is that you are seeing a lot of intrinsiclink 
activity.  If you are not using hopcount filtering, you can disable that 
entirely at the job level.  It might help you (can't be sure until the blocking 
culprit is found though).


> Found long running query in manifold scheduled job
> --
>
> Key: CONNECTORS-1592
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1592
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.12
>Reporter: Subasini Rath
>Priority: Major
> Attachments: LongRunningWithPlan_thread39.txt, 
> SELECT_blocked_queries.txt, postgresql.conf, properties.xml
>
>
> Hi Karl,
>    I am also facing the above mentioned issue. (Similar to Connector-880)
> I am using manifold2.12 binary version. I am using Solr output connector and 
> Web repository connection. Manifold is using all default configuration.
> When I am running the jobs manually, it runs fine. Same jobs have been 
> scheduled to run everyday.
> I am getting below exceptions and the job gets hanged/ going to waiting stage.
> Could you please help me in resolving the same.
> I am getting the below error -
> Scenario-1
> WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query 
> (2706114 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a 
> long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query 
> (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)]
>  WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: 
> 'e'
>  WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running 
> query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE 
> needpriority=? LIMIT 1000]
>  WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i'
>  WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062'
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a 
> long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter 
> 0: 'S'
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query 
> (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR 

[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job

2019-04-04 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809709#comment-16809709
 ] 

Karl Wright commented on CONNECTORS-1592:
-

[~goovaertsr], this is a perfectly fine plan, as you see in the execution 
estimates here:

{code}
WARN 2019-04-03T14:09:04,328 (Worker thread '39') -  Plan: Planning time: 0.706 
ms
 WARN 2019-04-03T14:09:04,328 (Worker thread '39') -  Plan: Execution time: 
0.382 ms
{code}

And yet the time (in this case) is 2 seconds for execution, which is still not 
bad actually, given that MCF is pounding on the database.

As I said before, there is no indication of actual bad plans.  Instead, the 
database as a whole is going offline or is being locked down for an extended 
period of time.



> Found long running query in manifold scheduled job
> --
>
> Key: CONNECTORS-1592
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1592
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.12
>Reporter: Subasini Rath
>Priority: Major
> Attachments: LongRunningWithPlan_thread39.txt
>
>
> Hi Karl,
>    I am also facing the above mentioned issue. (Similar to Connector-880)
> I am using manifold2.12 binary version. I am using Solr output connector and 
> Web repository connection. Manifold is using all default configuration.
> When I am running the jobs manually, it runs fine. Same jobs have been 
> scheduled to run everyday.
> I am getting below exceptions and the job gets hanged/ going to waiting stage.
> Could you please help me in resolving the same.
> I am getting the below error -
> Scenario-1
> WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query 
> (2706114 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a 
> long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query 
> (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)]
>  WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: 
> 'e'
>  WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running 
> query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE 
> needpriority=? LIMIT 1000]
>  WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i'
>  WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062'
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a 
> long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter 
> 0: 'S'
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query 
> (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE]
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R'
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running 
> query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE]
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Parameter 0: 'E'
>  WARN 2019-03-08T23:58:20,483 (qtp550147359-4339) - Found a long-running 
> query (2496641 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: 
> isDistinctSelect=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isGrouped=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isAggregated=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: columns=[ COLUMN: 
> PUBLIC.JOBS.ID not nullable
>  WARN 2019-03-08T23:58:20,492 (qtp550147359-4346) - Found a long-running 
> query (2435908 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: 
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: ]
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: [range variable 1
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: join type=INNER
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: table=SYSTEM_SUBQUERY
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: cardinality=0
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: access=FULL SCAN
> 

[jira] [Commented] (CONNECTORS-1598) session based authentication cannot register 401

2019-04-03 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808808#comment-16808808
 ] 

Karl Wright commented on CONNECTORS-1598:
-

Hi [~goovaertsr], if your site authentication involves cookies, you really need 
to have session-based authentication set up.  Furthermore, you do NOT want to 
include the login page in your seed list.  Instead, you want to set up a login 
sequence (which is attached to a specific set of URLs that define the 
session-protected part of your site), which will be triggered if the login 
needs to be done.

What session-based Auth does is the following:
- detects when accessing a content page fails because of missing session login
- provides a way of walking through the session login page sequence that sets 
the cookies
- retries the content page fetch with the correct cookies that have been logged 
in

It is therefore critical to configure the session-based access so that it 
properly detects when an invalid, missing, or expired session cookie is 
detected by the site you are crawling.  If you've already read the end-user 
documentation about this, then this should be clear.

If I understand your problem, your site does not redirect to a login page when 
there is no session cookie: it simply returns a 401.  That's not a very typical 
flow for session-based code, but you should nevertheless be able to match 
specific page contents associated with the 401 response.  In HTTP, all response 
codes can have content, and 401 is no different, so I assume this is the case?

To answer your question:
{quote}
is 4xx tied to page base authentication and 3xx tied to session based 
authentication?
{quote}

4xx responses are handled only as page-based authentication.  That is the 
meaning of 4xx responses typically in HTTP.


> session based authentication cannot register 401
> 
>
> Key: CONNECTORS-1598
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1598
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Major
>
> Description:
> Access to a specific domain is restricted by being A) an intranet service B) 
> based on an employee/costumer profile.
> For manifold to be able to be authenticated there is a specific 
> '\{{domain}}/login' page with a form where manifold was configured to enter 
> it's username and password. A session-cookie is then set so manifold is 
> authenticated to access all resources. If a request for a resource is not 
> authenticated the service throws a 401. When the service returns a 401 the 
> actual content of the resource includes the same form as is present in 
> '\{{domain}}/login'.
> Problem:
> The only way we have been able to configure manifold to be authenticated was 
> by specifying session-based credentials AND providing '\{{domain}}/login' as 
> a seed in the job as well. The only other seed in the job is a sitemap.
> This is of course not ideal since it can easily happen that the seed for the 
> sitemap gets processed first, which then throws a 401 on the sitemap and the 
> job stops.
> Another possible scenario with this configuration is that the cookie expires 
> and all other resources throw 401 and get deleted from the index 
> (elasticsearch). There is also another job (different language, same domain), 
> usage of the cookie from the previous job has also been registered.
> Current session-based access credentials configuration:
> --url regular expression : https://\{{domain}}/
> --login pages: 
> ---login url regexp : 'login'
> ---page type : form
> ---identification regexp is set to match the form-name
> ---form parameters are filled with the correct parameters
> This is verified to work, but as my understanding this only works because the 
> login-page is part of the seeds and so it matches the url when it comes 
> across it when crawling. There is no configuration yet which redirects (for 
> example) to this page when manifold receives a 401.
> My goal was then to remove the login-page from the seeds and configure the 
> job so that each time a fetch returns a 401, manifold knows to go to the 
> login page. in pseudo code:
> --If authenticated
> ---process 
> --else
> ---redirect to login
> ---retry resource
>  
> Based on the documentation here: 
> https://manifoldcf.apache.org/release/release-2.12/en_US/end-user-documentation.html#webrepository
>  I tried a few different configurations. The first thing to notice is in the 
> comparison table, 'page based authentication' only mentions 4xx and 'session 
> based authentication' only mentions 3xx.
> At this time my biggest question is; are these response codes bound to the 
> difference in settings between page and session based? As far I have been 
> able 

[jira] [Comment Edited] (CONNECTORS-1592) Found long running query in manifold scheduled job

2019-04-03 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808652#comment-16808652
 ] 

Karl Wright edited comment on CONNECTORS-1592 at 4/3/19 12:08 PM:
--

[~goovaertsr], when a large number of queries are blocked and do not get 
executed for a while (in this case 2706114 ms or so), then when all of them 
finally fire they are all reported as "slow running queries".  The question is: 
why are all of these queries blocked?

Tuple bloat just makes the database generally get slower and slower, so that is 
not it.

If you execute "VACUUM FULL" while ManifoldCF is running, that *could* do it, 
since tables get completely locked one at a time and are recreated.  It is 
recommended that you either shut ManifoldCF down during this time, or create a 
"signalling file" which tells ManifoldCF to not do any real work until it goes 
away.  Your choice.  If you want to know more about the latter option, please 
let me know.

If this isn't due to a concurrent "VACUUM FULL", then we're left with finding 
some other cause.  While it is taking place, there may be a way of getting 
Postgresql's state across all requests; that would be the ideal way to figure 
it out.

FWIW, the following query should be instantaneous, which is why it appears to 
me that the whole database is locked down somehow:

{code}
WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query 
(2706114 ms): [SELECT 
t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
t0 ORDER BY description ASC]
{code}



was (Author: kwri...@metacarta.com):
[~goovaertsr], when a large number of queries are blocked and do not get 
executed for a while (in this case 132000ms or so), then when all of them 
finally fire they are all reported as "slow running queries".  The question is: 
why are all of these queries blocked?

Tuple bloat just makes the database generally get slower and slower, so that is 
not it.

If you execute "VACUUM FULL" while ManifoldCF is running, that *could* do it, 
since tables get completely locked one at a time and are recreated.  It is 
recommended that you either shut ManifoldCF down during this time, or create a 
"signalling file" which tells ManifoldCF to not do any real work until it goes 
away.  Your choice.  If you want to know more about the latter option, please 
let me know.

If this isn't due to a concurrent "VACUUM FULL", then we're left with finding 
some other cause.  While it is taking place, there may be a way of getting 
Postgresql's state across all requests; that would be the ideal way to figure 
it out.


> Found long running query in manifold scheduled job
> --
>
> Key: CONNECTORS-1592
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1592
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.12
>Reporter: Subasini Rath
>Priority: Major
>
> Hi Karl,
>    I am also facing the above mentioned issue. (Similar to Connector-880)
> I am using manifold2.12 binary version. I am using Solr output connector and 
> Web repository connection. Manifold is using all default configuration.
> When I am running the jobs manually, it runs fine. Same jobs have been 
> scheduled to run everyday.
> I am getting below exceptions and the job gets hanged/ going to waiting stage.
> Could you please help me in resolving the same.
> I am getting the below error -
> Scenario-1
> WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query 
> (2706114 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a 
> long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query 
> (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)]
>  WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: 
> 'e'
>  WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running 
> query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE 
> needpriority=? LIMIT 1000]
>  WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i'
>  WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062'
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a 
> long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter 
> 0: 'S'
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) 

[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job

2019-04-03 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808652#comment-16808652
 ] 

Karl Wright commented on CONNECTORS-1592:
-

[~goovaertsr], when a large number of queries are blocked and do not get 
executed for a while (in this case 132000ms or so), then when all of them 
finally fire they are all reported as "slow running queries".  The question is: 
why are all of these queries blocked?

Tuple bloat just makes the database generally get slower and slower, so that is 
not it.

If you execute "VACUUM FULL" while ManifoldCF is running, that *could* do it, 
since tables get completely locked one at a time and are recreated.  It is 
recommended that you either shut ManifoldCF down during this time, or create a 
"signalling file" which tells ManifoldCF to not do any real work until it goes 
away.  Your choice.  If you want to know more about the latter option, please 
let me know.

If this isn't due to a concurrent "VACUUM FULL", then we're left with finding 
some other cause.  While it is taking place, there may be a way of getting 
Postgresql's state across all requests; that would be the ideal way to figure 
it out.


> Found long running query in manifold scheduled job
> --
>
> Key: CONNECTORS-1592
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1592
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.12
>Reporter: Subasini Rath
>Priority: Major
>
> Hi Karl,
>    I am also facing the above mentioned issue. (Similar to Connector-880)
> I am using manifold2.12 binary version. I am using Solr output connector and 
> Web repository connection. Manifold is using all default configuration.
> When I am running the jobs manually, it runs fine. Same jobs have been 
> scheduled to run everyday.
> I am getting below exceptions and the job gets hanged/ going to waiting stage.
> Could you please help me in resolving the same.
> I am getting the below error -
> Scenario-1
> WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query 
> (2706114 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a 
> long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query 
> (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)]
>  WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: 
> 'e'
>  WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running 
> query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE 
> needpriority=? LIMIT 1000]
>  WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i'
>  WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062'
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a 
> long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter 
> 0: 'S'
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query 
> (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE]
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R'
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running 
> query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE]
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Parameter 0: 'E'
>  WARN 2019-03-08T23:58:20,483 (qtp550147359-4339) - Found a long-running 
> query (2496641 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: 
> isDistinctSelect=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isGrouped=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isAggregated=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: columns=[ COLUMN: 
> PUBLIC.JOBS.ID not nullable
>  WARN 2019-03-08T23:58:20,492 (qtp550147359-4346) - Found a long-running 
> query (2435908 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: 
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: ]
>  WARN 

[jira] [Assigned] (CONNECTORS-1595) cross-site request forgery vulnerability

2019-03-28 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1595:
---

Assignee: Kishore Kumar

> cross-site request forgery vulnerability
> 
>
> Key: CONNECTORS-1595
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1595
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Assignee: Kishore Kumar
>Priority: Minor
>
> Below is the full analysis and description as a result from the penetration 
> test.
> *Summary*
> The application is vulnerable to Cross-Site Request Forgery (CSRF).
> A cross-site request forgery attack uses the following scenario:
> 1. An attacker creates a web page that includes an image or a form pointing 
> to the attacked application.
> The image source would actually be a URL with parameters pointing to the 
> application page that
> performs some action. In case of a form, the form action would point to the 
> action page in the target
> application, and the form is submitted automatically by JavaScript when the 
> page is viewed.
> 2. The attacker tricks the victim user to browse to this page. The attacker 
> may get the victim to click a
> link, or embed the attacking HTML code into some page the victim views, for 
> example in a bulletin
> board or chat.
> 3. When the victim views the attacker's page, his browser sends a request 
> prepared by the attacker to
> the attacked application. If the victim is logged in to the target 
> application, his browser will possess
> all necessary session tokens, so the request will appear as authorized to the 
> application and
> succeed.
> A cross-site request forgery attack uses the fact that the victim's browser 
> possesses the necessary
> authentication tokens to perform some actions in the target application.
> *Impact*
> A remote, unauthenticated attacker that can trick an authenticated user into 
> clicking a link crafted by the
> attacker or open a malicious web page, can force the victim to unknowingly 
> perform various actions within
> the application.
> Given that the whole application is not protected against CSRF, any action 
> that an administrator can take on
> Apache Manifold could be unknowingly performed if they fall for a CSRF attack.
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/]
> *Description*
> It appears that the application does not implement any CSRF protection. 
> Consider the following example. An
> attacker tricks a logged in application user to visit a page containing the 
> following code:
> {code:java}
> 
> 
> 
> history.pushState('', '', '/')
> https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp;
> method="POST" enctype="multipart/form-data">
> 
> 
> 
> 
> 
> 
>  value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr
> awlerConnector" />
> 
> 
> 
>  value="ferdiklompcraftworkznl" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  value="httpsintrauatwebbc" />
>  />
> 
>  value="validation" />
>  value=""
> />
>  value="Continue" />
>  value="username" />
>  value="id996812" />
>  value="" />
>  value="Continue" />
>  value="password" />
>  value="Th1sIs4cl1X" />
>  value="" />
>  value="Continue" />
>  value="loginformtype" />
>  value="pwd" />
>  value="" />
>  value="3" />
> 
> 
>  value="httpsintrauatwebbc" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> When the victim's browser parses the page and tries to load images, it will 
> cause them to execute any action
> of the attacker's choosing on Manifold.
> *Recommendations*
> The usual approach to preventing CSRF attacks is to add a new parameter with 
> an unpredictable value to
> each form or link that performs some action in the application, commonly 
> referred to as a CSRF-Token. The
> parameter value should have enough entropy so that it cannot be predicted by 
> an attacker and should be
> unique to the current user session. When the user submits the form or clicks 
> the link, the server side code
> checks the parameter value. If it is valid, the request is accepted, 
> otherwise it is denied. The attacker has no
> way of knowing the value of the unpredictable parameter, so he cannot 
> construct a form or link that will
> submit a valid request.
> *References*
>  * OWASP - Cross-Site Request Forgery - 
> [https://www.owasp.org/index.php/Cross-]
> Site_Request_Forgery



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1595) cross-site request forgery vulnerability

2019-03-28 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803842#comment-16803842
 ] 

Karl Wright commented on CONNECTORS-1595:
-

[~goovaertsr] I am going to assign these to the fellow who wrote the current UI 
and see what he says.  I expect some things would be easier to address than 
others.


> cross-site request forgery vulnerability
> 
>
> Key: CONNECTORS-1595
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1595
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Minor
>
> Below is the full analysis and description as a result from the penetration 
> test.
> *Summary*
> The application is vulnerable to Cross-Site Request Forgery (CSRF).
> A cross-site request forgery attack uses the following scenario:
> 1. An attacker creates a web page that includes an image or a form pointing 
> to the attacked application.
> The image source would actually be a URL with parameters pointing to the 
> application page that
> performs some action. In case of a form, the form action would point to the 
> action page in the target
> application, and the form is submitted automatically by JavaScript when the 
> page is viewed.
> 2. The attacker tricks the victim user to browse to this page. The attacker 
> may get the victim to click a
> link, or embed the attacking HTML code into some page the victim views, for 
> example in a bulletin
> board or chat.
> 3. When the victim views the attacker's page, his browser sends a request 
> prepared by the attacker to
> the attacked application. If the victim is logged in to the target 
> application, his browser will possess
> all necessary session tokens, so the request will appear as authorized to the 
> application and
> succeed.
> A cross-site request forgery attack uses the fact that the victim's browser 
> possesses the necessary
> authentication tokens to perform some actions in the target application.
> *Impact*
> A remote, unauthenticated attacker that can trick an authenticated user into 
> clicking a link crafted by the
> attacker or open a malicious web page, can force the victim to unknowingly 
> perform various actions within
> the application.
> Given that the whole application is not protected against CSRF, any action 
> that an administrator can take on
> Apache Manifold could be unknowingly performed if they fall for a CSRF attack.
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/]
> *Description*
> It appears that the application does not implement any CSRF protection. 
> Consider the following example. An
> attacker tricks a logged in application user to visit a page containing the 
> following code:
> {code:java}
> 
> 
> 
> history.pushState('', '', '/')
> https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp;
> method="POST" enctype="multipart/form-data">
> 
> 
> 
> 
> 
> 
>  value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr
> awlerConnector" />
> 
> 
> 
>  value="ferdiklompcraftworkznl" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  value="httpsintrauatwebbc" />
>  />
> 
>  value="validation" />
>  value=""
> />
>  value="Continue" />
>  value="username" />
>  value="id996812" />
>  value="" />
>  value="Continue" />
>  value="password" />
>  value="Th1sIs4cl1X" />
>  value="" />
>  value="Continue" />
>  value="loginformtype" />
>  value="pwd" />
>  value="" />
>  value="3" />
> 
> 
>  value="httpsintrauatwebbc" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> When the victim's browser parses the page and tries to load images, it will 
> cause them to execute any action
> of the attacker's choosing on Manifold.
> *Recommendations*
> The usual approach to preventing CSRF attacks is to add a new parameter with 
> an unpredictable value to
> each form or link that performs some action in the application, commonly 
> referred to as a CSRF-Token. The
> parameter value should have enough entropy so that it cannot be predicted by 
> an attacker and should be
> unique to the current user session. When the user submits the form or clicks 
> the link, the server side code
> checks the parameter value. If it is valid, the request is accepted, 
> otherwise it is denied. The attacker has no
> way of knowing the value of the unpredictable parameter, so he cannot 
> construct a form or link that will
> submit a valid request.
> *References*
>  * OWASP - Cross-Site Request Forgery - 
> [https://www.owasp.org/index.php/Cross-]
> Site_Request_Forgery



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1597) reflected cross-site scripting vulnerability

2019-03-28 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1597:
---

Assignee: Kishore Kumar  (was: Karl Wright)

> reflected cross-site scripting vulnerability
> 
>
> Key: CONNECTORS-1597
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1597
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Assignee: Kishore Kumar
>Priority: Minor
>
> This is the full report of a penetration test, performed at a client where we 
> deployed a system which uses manifold:
> *Summary*
> A reflected cross-site scripting vulnerability was discovered in the 
> application.
> Reflected cross-site scripting occurs when a web application displays data 
> submitted by the user that
> contains HTML markup and scripting code without properly escaping it. An 
> attacker will create a link to the
> vulnerable page that will display JavaScript code crated by the attacker. The 
> attacker will then trick an
> authenticated application user into clicking or following this crated link. 
> When the user's browser parses the
> generated page, it will execute the code crafted by the attacker. If the user 
> was logged in to the application
> when he followed the link, the attacker's code could perform any action in 
> the application that the user can
> perform.
> *Impact*
> Reflected cross-site scripting can be used by attackers to compromise the 
> session of an authenticated user.
> By persuading the victim to click on a specially crafted link, the attacker 
> can execute his own JavaScript
> payload in the browser context of the victim. In this specific case, an 
> attacker could hijack its victim's session
> given that the session token is not flagged as HttpOnly as demonstrated in 
> [G190204T1F4][MANIFOLD]
> Insecure Cookie Configuration.
> Additional attacks exist where an attacker can deceive end users of the 
> application by redirecting them to
> replica sites or trick them into downloading trojans or other malware. The 
> attacker can also use a so called
> browser exploitation framework. In this scenario the attacker injects 
> JavaScript code that communicates to
> the attack framework running on the attacker's computer. When the victim user 
> executes the JavaScript code
> the attacker can control the victim's browser. Publicly available frameworks 
> exist (BeEF -
> [http://www.bindshell.net/tools/beef], Backframe 
> -[http://www.gnucitizen.org/projects/backframe/], XSS Proxy -
> [http://xss-proxy.sourceforge.net/]).
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] [name of an arbitrarily 
> supplied URL parameter]
> *Description*
> A case where the application includes user input into the generated HTML 
> pages without properly escaping
> the user supplied data was discovered in the application. The HTTP requests 
> and responses shown below
> demonstrate the problem.
> {code:java}
> GET /mcf-crawler-ui/?smafi">alert(1)non7x=1 HTTP/1.1
> Host: els-manifold-uat.bc:8475
> Accept-Encoding: gzip, deflate
> Accept: */*
> Accept-Language: en
> User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; 
> Trident/5.0)
> Connection: close
> Cookie: JSESSIONID=ov3qae9biucxdat0xiin5s18
> {code}
> {code:java}
> HTTP/1.1 200 OK
> Server: nginx/1.12.2
> Date: Mon, 18 Feb 2019 13:07:02 GMT
> Content-Type: text/html;charset=utf-8
> Content-Length: 2576
> Connection: close
> Pragma: No-cache
> Expires: Thu, 01 Jan 1970 00:00:00 GMT
> Cache-Control: no-cache
> max-age: Thu, 01 Jan 1970 00:00:00 GMT
> 
> 
> 
> http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
>  type="text/css"/>
> 
> Apache ManifoldCF™ Login
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Sign in to start your session
>  method="POST">
> alert(1)non7x=1">
> 
> --snip--
> {code}
> *Recommendations*
> We recommend that the application enforces proper validation on user input. 
> In most situations where usercontrollable
> data is copied into application responses, cross-site scripting attacks can 
> be prevented using two
> layers of defenses:
>  * Input should be validated as strictly as possible on arrival, given the 
> kind of content which it is
> expected to contain. For example, personal names should consist of 
> alphabetical and a small range
> of typographical characters, and be relatively short; a year of birth should 
> consist of exactly four
> numerals; email addresses should match a 

[jira] [Assigned] (CONNECTORS-1597) reflected cross-site scripting vulnerability

2019-03-28 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1597:
---

Assignee: Karl Wright

> reflected cross-site scripting vulnerability
> 
>
> Key: CONNECTORS-1597
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1597
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Assignee: Karl Wright
>Priority: Minor
>
> This is the full report of a penetration test, performed at a client where we 
> deployed a system which uses manifold:
> *Summary*
> A reflected cross-site scripting vulnerability was discovered in the 
> application.
> Reflected cross-site scripting occurs when a web application displays data 
> submitted by the user that
> contains HTML markup and scripting code without properly escaping it. An 
> attacker will create a link to the
> vulnerable page that will display JavaScript code crated by the attacker. The 
> attacker will then trick an
> authenticated application user into clicking or following this crated link. 
> When the user's browser parses the
> generated page, it will execute the code crafted by the attacker. If the user 
> was logged in to the application
> when he followed the link, the attacker's code could perform any action in 
> the application that the user can
> perform.
> *Impact*
> Reflected cross-site scripting can be used by attackers to compromise the 
> session of an authenticated user.
> By persuading the victim to click on a specially crafted link, the attacker 
> can execute his own JavaScript
> payload in the browser context of the victim. In this specific case, an 
> attacker could hijack its victim's session
> given that the session token is not flagged as HttpOnly as demonstrated in 
> [G190204T1F4][MANIFOLD]
> Insecure Cookie Configuration.
> Additional attacks exist where an attacker can deceive end users of the 
> application by redirecting them to
> replica sites or trick them into downloading trojans or other malware. The 
> attacker can also use a so called
> browser exploitation framework. In this scenario the attacker injects 
> JavaScript code that communicates to
> the attack framework running on the attacker's computer. When the victim user 
> executes the JavaScript code
> the attacker can control the victim's browser. Publicly available frameworks 
> exist (BeEF -
> [http://www.bindshell.net/tools/beef], Backframe 
> -[http://www.gnucitizen.org/projects/backframe/], XSS Proxy -
> [http://xss-proxy.sourceforge.net/]).
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] [name of an arbitrarily 
> supplied URL parameter]
> *Description*
> A case where the application includes user input into the generated HTML 
> pages without properly escaping
> the user supplied data was discovered in the application. The HTTP requests 
> and responses shown below
> demonstrate the problem.
> {code:java}
> GET /mcf-crawler-ui/?smafi">alert(1)non7x=1 HTTP/1.1
> Host: els-manifold-uat.bc:8475
> Accept-Encoding: gzip, deflate
> Accept: */*
> Accept-Language: en
> User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; 
> Trident/5.0)
> Connection: close
> Cookie: JSESSIONID=ov3qae9biucxdat0xiin5s18
> {code}
> {code:java}
> HTTP/1.1 200 OK
> Server: nginx/1.12.2
> Date: Mon, 18 Feb 2019 13:07:02 GMT
> Content-Type: text/html;charset=utf-8
> Content-Length: 2576
> Connection: close
> Pragma: No-cache
> Expires: Thu, 01 Jan 1970 00:00:00 GMT
> Cache-Control: no-cache
> max-age: Thu, 01 Jan 1970 00:00:00 GMT
> 
> 
> 
> http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
>  type="text/css"/>
> 
> Apache ManifoldCF™ Login
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Sign in to start your session
>  method="POST">
> alert(1)non7x=1">
> 
> --snip--
> {code}
> *Recommendations*
> We recommend that the application enforces proper validation on user input. 
> In most situations where usercontrollable
> data is copied into application responses, cross-site scripting attacks can 
> be prevented using two
> layers of defenses:
>  * Input should be validated as strictly as possible on arrival, given the 
> kind of content which it is
> expected to contain. For example, personal names should consist of 
> alphabetical and a small range
> of typographical characters, and be relatively short; a year of birth should 
> consist of exactly four
> numerals; email addresses should match a well-defined regular 

[jira] [Commented] (CONNECTORS-1597) reflected cross-site scripting vulnerability

2019-03-28 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803835#comment-16803835
 ] 

Karl Wright commented on CONNECTORS-1597:
-

Hi [~goovaertsr], I see that all of the security tickets you have opened have 
to do with usage of the ManifoldCF UI in an open web environment.

Please understand that the UI was not designed for the kinds of security 
concerns one might have in such an environment.  

The team here is small, and UI design is not an area that has a deep bench.  I 
would therefore urge you to include patches to address the concerns you have, 
in the best tradition of open-source software.  Otherwise there is little 
chance they will be competently addressed.

Thanks in advance.

> reflected cross-site scripting vulnerability
> 
>
> Key: CONNECTORS-1597
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1597
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Minor
>
> This is the full report of a penetration test, performed at a client where we 
> deployed a system which uses manifold:
> *Summary*
> A reflected cross-site scripting vulnerability was discovered in the 
> application.
> Reflected cross-site scripting occurs when a web application displays data 
> submitted by the user that
> contains HTML markup and scripting code without properly escaping it. An 
> attacker will create a link to the
> vulnerable page that will display JavaScript code crated by the attacker. The 
> attacker will then trick an
> authenticated application user into clicking or following this crated link. 
> When the user's browser parses the
> generated page, it will execute the code crafted by the attacker. If the user 
> was logged in to the application
> when he followed the link, the attacker's code could perform any action in 
> the application that the user can
> perform.
> *Impact*
> Reflected cross-site scripting can be used by attackers to compromise the 
> session of an authenticated user.
> By persuading the victim to click on a specially crafted link, the attacker 
> can execute his own JavaScript
> payload in the browser context of the victim. In this specific case, an 
> attacker could hijack its victim's session
> given that the session token is not flagged as HttpOnly as demonstrated in 
> [G190204T1F4][MANIFOLD]
> Insecure Cookie Configuration.
> Additional attacks exist where an attacker can deceive end users of the 
> application by redirecting them to
> replica sites or trick them into downloading trojans or other malware. The 
> attacker can also use a so called
> browser exploitation framework. In this scenario the attacker injects 
> JavaScript code that communicates to
> the attack framework running on the attacker's computer. When the victim user 
> executes the JavaScript code
> the attacker can control the victim's browser. Publicly available frameworks 
> exist (BeEF -
> [http://www.bindshell.net/tools/beef], Backframe 
> -[http://www.gnucitizen.org/projects/backframe/], XSS Proxy -
> [http://xss-proxy.sourceforge.net/]).
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/] [name of an arbitrarily 
> supplied URL parameter]
> *Description*
> A case where the application includes user input into the generated HTML 
> pages without properly escaping
> the user supplied data was discovered in the application. The HTTP requests 
> and responses shown below
> demonstrate the problem.
> {code:java}
> GET /mcf-crawler-ui/?smafi">alert(1)non7x=1 HTTP/1.1
> Host: els-manifold-uat.bc:8475
> Accept-Encoding: gzip, deflate
> Accept: */*
> Accept-Language: en
> User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; 
> Trident/5.0)
> Connection: close
> Cookie: JSESSIONID=ov3qae9biucxdat0xiin5s18
> {code}
> {code:java}
> HTTP/1.1 200 OK
> Server: nginx/1.12.2
> Date: Mon, 18 Feb 2019 13:07:02 GMT
> Content-Type: text/html;charset=utf-8
> Content-Length: 2576
> Connection: close
> Pragma: No-cache
> Expires: Thu, 01 Jan 1970 00:00:00 GMT
> Cache-Control: no-cache
> max-age: Thu, 01 Jan 1970 00:00:00 GMT
> 
> 
> 
> http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
>  type="text/css"/>
> 
> Apache ManifoldCF™ Login
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Sign in to start your session
>  method="POST">
> alert(1)non7x=1">
> 
> --snip--
> {code}
> *Recommendations*
> We recommend that the application enforces proper validation on user input. 
> In most 

[jira] [Commented] (CONNECTORS-1595) cross-site request forgery vulnerability

2019-03-28 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803802#comment-16803802
 ] 

Karl Wright commented on CONNECTORS-1595:
-

[~goovaertsr]: For all of the security tickets you have submitted against MCF, 
we have no ability to address these ourselves; this is a small project and 
essentially you are attempting to make the MCF UI safe to operate in an open 
web environment.  That was not its design point, either at the beginning or 
ever.

We are always receptive to patches, so if you have specific code changes you 
want us to consider, please feel free to attach appropriate patches to the 
tickets you have created.

Thank you.


> cross-site request forgery vulnerability
> 
>
> Key: CONNECTORS-1595
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1595
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Minor
>
> Below is the full analysis and description as a result from the penetration 
> test.
> *Summary*
> The application is vulnerable to Cross-Site Request Forgery (CSRF).
> A cross-site request forgery attack uses the following scenario:
> 1. An attacker creates a web page that includes an image or a form pointing 
> to the attacked application.
> The image source would actually be a URL with parameters pointing to the 
> application page that
> performs some action. In case of a form, the form action would point to the 
> action page in the target
> application, and the form is submitted automatically by JavaScript when the 
> page is viewed.
> 2. The attacker tricks the victim user to browse to this page. The attacker 
> may get the victim to click a
> link, or embed the attacking HTML code into some page the victim views, for 
> example in a bulletin
> board or chat.
> 3. When the victim views the attacker's page, his browser sends a request 
> prepared by the attacker to
> the attacked application. If the victim is logged in to the target 
> application, his browser will possess
> all necessary session tokens, so the request will appear as authorized to the 
> application and
> succeed.
> A cross-site request forgery attack uses the fact that the victim's browser 
> possesses the necessary
> authentication tokens to perform some actions in the target application.
> *Impact*
> A remote, unauthenticated attacker that can trick an authenticated user into 
> clicking a link crafted by the
> attacker or open a malicious web page, can force the victim to unknowingly 
> perform various actions within
> the application.
> Given that the whole application is not protected against CSRF, any action 
> that an administrator can take on
> Apache Manifold could be unknowingly performed if they fall for a CSRF attack.
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/]
> *Description*
> It appears that the application does not implement any CSRF protection. 
> Consider the following example. An
> attacker tricks a logged in application user to visit a page containing the 
> following code:
> {code:java}
> 
> 
> 
> history.pushState('', '', '/')
> https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp;
> method="POST" enctype="multipart/form-data">
> 
> 
> 
> 
> 
> 
>  value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr
> awlerConnector" />
> 
> 
> 
>  value="ferdiklompcraftworkznl" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  value="httpsintrauatwebbc" />
>  />
> 
>  value="validation" />
>  value=""
> />
>  value="Continue" />
>  value="username" />
>  value="id996812" />
>  value="" />
>  value="Continue" />
>  value="password" />
>  value="Th1sIs4cl1X" />
>  value="" />
>  value="Continue" />
>  value="loginformtype" />
>  value="pwd" />
>  value="" />
>  value="3" />
> 
> 
>  value="httpsintrauatwebbc" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> When the victim's browser parses the page and tries to load images, it will 
> cause them to execute any action
> of the attacker's choosing on Manifold.
> *Recommendations*
> The usual approach to preventing CSRF attacks is to add a new parameter with 
> an unpredictable value to
> each form or link that performs some action in the application, commonly 
> referred to as a CSRF-Token. The
> parameter value should have enough entropy so that it cannot be predicted by 
> an attacker and should be
> unique to the current user session. When the user submits the form or clicks 
> the link, the server side code
> checks the parameter value. If it is valid, the request is accepted, 
> otherwise it is denied. The attacker has no
> way of knowing the value of the unpredictable parameter, so he cannot 
> construct a form or link that will
> submit a valid request.
> *References*
>  * OWASP - Cross-Site 

[jira] [Reopened] (CONNECTORS-1595) cross-site request forgery vulnerability

2019-03-28 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reopened CONNECTORS-1595:
-

The complaint is that the manifoldcf user interface has this issue.

Once again, the MCF user interface is a back-office app and does not go against 
untrusted open network systems.


> cross-site request forgery vulnerability
> 
>
> Key: CONNECTORS-1595
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1595
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Minor
>
> Below is the full analysis and description as a result from the penetration 
> test.
> *Summary*
> The application is vulnerable to Cross-Site Request Forgery (CSRF).
> A cross-site request forgery attack uses the following scenario:
> 1. An attacker creates a web page that includes an image or a form pointing 
> to the attacked application.
> The image source would actually be a URL with parameters pointing to the 
> application page that
> performs some action. In case of a form, the form action would point to the 
> action page in the target
> application, and the form is submitted automatically by JavaScript when the 
> page is viewed.
> 2. The attacker tricks the victim user to browse to this page. The attacker 
> may get the victim to click a
> link, or embed the attacking HTML code into some page the victim views, for 
> example in a bulletin
> board or chat.
> 3. When the victim views the attacker's page, his browser sends a request 
> prepared by the attacker to
> the attacked application. If the victim is logged in to the target 
> application, his browser will possess
> all necessary session tokens, so the request will appear as authorized to the 
> application and
> succeed.
> A cross-site request forgery attack uses the fact that the victim's browser 
> possesses the necessary
> authentication tokens to perform some actions in the target application.
> *Impact*
> A remote, unauthenticated attacker that can trick an authenticated user into 
> clicking a link crafted by the
> attacker or open a malicious web page, can force the victim to unknowingly 
> perform various actions within
> the application.
> Given that the whole application is not protected against CSRF, any action 
> that an administrator can take on
> Apache Manifold could be unknowingly performed if they fall for a CSRF attack.
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/]
> *Description*
> It appears that the application does not implement any CSRF protection. 
> Consider the following example. An
> attacker tricks a logged in application user to visit a page containing the 
> following code:
> {code:java}
> 
> 
> 
> history.pushState('', '', '/')
> https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp;
> method="POST" enctype="multipart/form-data">
> 
> 
> 
> 
> 
> 
>  value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr
> awlerConnector" />
> 
> 
> 
>  value="ferdiklompcraftworkznl" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  value="httpsintrauatwebbc" />
>  />
> 
>  value="validation" />
>  value=""
> />
>  value="Continue" />
>  value="username" />
>  value="id996812" />
>  value="" />
>  value="Continue" />
>  value="password" />
>  value="Th1sIs4cl1X" />
>  value="" />
>  value="Continue" />
>  value="loginformtype" />
>  value="pwd" />
>  value="" />
>  value="3" />
> 
> 
>  value="httpsintrauatwebbc" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> When the victim's browser parses the page and tries to load images, it will 
> cause them to execute any action
> of the attacker's choosing on Manifold.
> *Recommendations*
> The usual approach to preventing CSRF attacks is to add a new parameter with 
> an unpredictable value to
> each form or link that performs some action in the application, commonly 
> referred to as a CSRF-Token. The
> parameter value should have enough entropy so that it cannot be predicted by 
> an attacker and should be
> unique to the current user session. When the user submits the form or clicks 
> the link, the server side code
> checks the parameter value. If it is valid, the request is accepted, 
> otherwise it is denied. The attacker has no
> way of knowing the value of the unpredictable parameter, so he cannot 
> construct a form or link that will
> submit a valid request.
> *References*
>  * OWASP - Cross-Site Request Forgery - 
> [https://www.owasp.org/index.php/Cross-]
> Site_Request_Forgery



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (CONNECTORS-1595) cross-site request forgery vulnerability

2019-03-28 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1595:

Comment: was deleted

(was: This is not applicable to MCF, since the domain scope of the pages 
fetched by it during a web crawl are explicitly laid out by configuration, and 
thus "redirection to a malicious page" is not something that can actually take 
place unless the person who sets up the crawling job does this by specific 
design.
)

> cross-site request forgery vulnerability
> 
>
> Key: CONNECTORS-1595
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1595
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Minor
>
> Below is the full analysis and description as a result from the penetration 
> test.
> *Summary*
> The application is vulnerable to Cross-Site Request Forgery (CSRF).
> A cross-site request forgery attack uses the following scenario:
> 1. An attacker creates a web page that includes an image or a form pointing 
> to the attacked application.
> The image source would actually be a URL with parameters pointing to the 
> application page that
> performs some action. In case of a form, the form action would point to the 
> action page in the target
> application, and the form is submitted automatically by JavaScript when the 
> page is viewed.
> 2. The attacker tricks the victim user to browse to this page. The attacker 
> may get the victim to click a
> link, or embed the attacking HTML code into some page the victim views, for 
> example in a bulletin
> board or chat.
> 3. When the victim views the attacker's page, his browser sends a request 
> prepared by the attacker to
> the attacked application. If the victim is logged in to the target 
> application, his browser will possess
> all necessary session tokens, so the request will appear as authorized to the 
> application and
> succeed.
> A cross-site request forgery attack uses the fact that the victim's browser 
> possesses the necessary
> authentication tokens to perform some actions in the target application.
> *Impact*
> A remote, unauthenticated attacker that can trick an authenticated user into 
> clicking a link crafted by the
> attacker or open a malicious web page, can force the victim to unknowingly 
> perform various actions within
> the application.
> Given that the whole application is not protected against CSRF, any action 
> that an administrator can take on
> Apache Manifold could be unknowingly performed if they fall for a CSRF attack.
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/]
> *Description*
> It appears that the application does not implement any CSRF protection. 
> Consider the following example. An
> attacker tricks a logged in application user to visit a page containing the 
> following code:
> {code:java}
> 
> 
> 
> history.pushState('', '', '/')
> https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp;
> method="POST" enctype="multipart/form-data">
> 
> 
> 
> 
> 
> 
>  value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr
> awlerConnector" />
> 
> 
> 
>  value="ferdiklompcraftworkznl" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  value="httpsintrauatwebbc" />
>  />
> 
>  value="validation" />
>  value=""
> />
>  value="Continue" />
>  value="username" />
>  value="id996812" />
>  value="" />
>  value="Continue" />
>  value="password" />
>  value="Th1sIs4cl1X" />
>  value="" />
>  value="Continue" />
>  value="loginformtype" />
>  value="pwd" />
>  value="" />
>  value="3" />
> 
> 
>  value="httpsintrauatwebbc" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> When the victim's browser parses the page and tries to load images, it will 
> cause them to execute any action
> of the attacker's choosing on Manifold.
> *Recommendations*
> The usual approach to preventing CSRF attacks is to add a new parameter with 
> an unpredictable value to
> each form or link that performs some action in the application, commonly 
> referred to as a CSRF-Token. The
> parameter value should have enough entropy so that it cannot be predicted by 
> an attacker and should be
> unique to the current user session. When the user submits the form or clicks 
> the link, the server side code
> checks the parameter value. If it is valid, the request is accepted, 
> otherwise it is denied. The attacker has no
> way of knowing the value of the unpredictable parameter, so he cannot 
> construct a form or link that will
> submit a valid request.
> *References*
>  * OWASP - Cross-Site Request Forgery - 
> [https://www.owasp.org/index.php/Cross-]
> Site_Request_Forgery



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1595) cross-site request forgery vulnerability

2019-03-28 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1595.
-
Resolution: Not A Problem

This is not applicable to MCF, since the domain scope of the pages fetched by 
it during a web crawl are explicitly laid out by configuration, and thus 
"redirection to a malicious page" is not something that can actually take place 
unless the person who sets up the crawling job does this by specific design.


> cross-site request forgery vulnerability
> 
>
> Key: CONNECTORS-1595
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1595
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Minor
>
> Below is the full analysis and description as a result from the penetration 
> test.
> *Summary*
> The application is vulnerable to Cross-Site Request Forgery (CSRF).
> A cross-site request forgery attack uses the following scenario:
> 1. An attacker creates a web page that includes an image or a form pointing 
> to the attacked application.
> The image source would actually be a URL with parameters pointing to the 
> application page that
> performs some action. In case of a form, the form action would point to the 
> action page in the target
> application, and the form is submitted automatically by JavaScript when the 
> page is viewed.
> 2. The attacker tricks the victim user to browse to this page. The attacker 
> may get the victim to click a
> link, or embed the attacking HTML code into some page the victim views, for 
> example in a bulletin
> board or chat.
> 3. When the victim views the attacker's page, his browser sends a request 
> prepared by the attacker to
> the attacked application. If the victim is logged in to the target 
> application, his browser will possess
> all necessary session tokens, so the request will appear as authorized to the 
> application and
> succeed.
> A cross-site request forgery attack uses the fact that the victim's browser 
> possesses the necessary
> authentication tokens to perform some actions in the target application.
> *Impact*
> A remote, unauthenticated attacker that can trick an authenticated user into 
> clicking a link crafted by the
> attacker or open a malicious web page, can force the victim to unknowingly 
> perform various actions within
> the application.
> Given that the whole application is not protected against CSRF, any action 
> that an administrator can take on
> Apache Manifold could be unknowingly performed if they fall for a CSRF attack.
> *Affected Systems*
>  * [https://els-manifold-uat.bc:8475/mcf-crawler-ui/]
> *Description*
> It appears that the application does not implement any CSRF protection. 
> Consider the following example. An
> attacker tricks a logged in application user to visit a page containing the 
> following code:
> {code:java}
> 
> 
> 
> history.pushState('', '', '/')
> https://els-manifold-uat.bc:8475/mcf-crawler-ui/execute.jsp;
> method="POST" enctype="multipart/form-data">
> 
> 
> 
> 
> 
> 
>  value="orgapachemanifoldcfcrawlerconnectorswebcrawlerWebcr
> awlerConnector" />
> 
> 
> 
>  value="ferdiklompcraftworkznl" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  value="httpsintrauatwebbc" />
>  />
> 
>  value="validation" />
>  value=""
> />
>  value="Continue" />
>  value="username" />
>  value="id996812" />
>  value="" />
>  value="Continue" />
>  value="password" />
>  value="Th1sIs4cl1X" />
>  value="" />
>  value="Continue" />
>  value="loginformtype" />
>  value="pwd" />
>  value="" />
>  value="3" />
> 
> 
>  value="httpsintrauatwebbc" />
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> When the victim's browser parses the page and tries to load images, it will 
> cause them to execute any action
> of the attacker's choosing on Manifold.
> *Recommendations*
> The usual approach to preventing CSRF attacks is to add a new parameter with 
> an unpredictable value to
> each form or link that performs some action in the application, commonly 
> referred to as a CSRF-Token. The
> parameter value should have enough entropy so that it cannot be predicted by 
> an attacker and should be
> unique to the current user session. When the user submits the form or clicks 
> the link, the server side code
> checks the parameter value. If it is valid, the request is accepted, 
> otherwise it is denied. The attacker has no
> way of knowing the value of the unpredictable parameter, so he cannot 
> construct a form or link that will
> submit a valid request.
> *References*
>  * OWASP - Cross-Site Request Forgery - 
> [https://www.owasp.org/index.php/Cross-]
> Site_Request_Forgery



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1596) brute-force vulnerability

2019-03-27 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803005#comment-16803005
 ] 

Karl Wright commented on CONNECTORS-1596:
-

The ManifoldCF UI is not expected to be used in an open web environment, but in 
a back-office environment.  Security protections designed to prevent remote 
hackers from getting into the UI using sophisticated tools are therefore not 
expected.

Similarly, there will be no attempt to implement dual-factor authentication for 
the MCF admin UI.


> brute-force vulnerability
> -
>
> Key: CONNECTORS-1596
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1596
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Minor
>
> As a result of a pen test, it appears there is no functionality to counter 
> brute-force attacks for logging in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1595) cross-site request forgery vulnerability

2019-03-27 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802998#comment-16802998
 ] 

Karl Wright commented on CONNECTORS-1595:
-

Please describe (1) what the attack looks like and (2) how this compromises MCF 
security.


> cross-site request forgery vulnerability
> 
>
> Key: CONNECTORS-1595
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1595
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Minor
>
> It appears that manifoldcf does not implement any CSRF protection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1594) insecure cookie configuration vulnerability

2019-03-27 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802996#comment-16802996
 ] 

Karl Wright commented on CONNECTORS-1594:
-

The issue described will not in any way hijack what MCF indexes.  The concern 
is that the session ID can be retrieved by a man-in-the-middle should you be 
crawling a Broadvision site that has both http and https pages.  I would argue 
that that is in fact a site design issue, not a MCF security vulnerability.



> insecure cookie configuration vulnerability
> ---
>
> Key: CONNECTORS-1594
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1594
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Minor
>
> The application session cookie "JSESSIONID" does not have Secure and HTTPOnly 
> flags set.
> The application uses an HTTP cookie as session identifier. The Set-Cookie 
> instruction sent by the application to the browser does not specifically 
> instruct the browser to only use the cookie on secure communication channels 
> (HTTPS). As the instruction is missing, browsers will fall back to their 
> default setting, generally meaning that the cookie will be used on both 
> secure and insecure communication channels.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1597) reflected cross-site scripting vulnerability

2019-03-27 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802984#comment-16802984
 ] 

Karl Wright commented on CONNECTORS-1597:
-

Please give more details.
Bear in mind that ManifoldCF does not execute any Javascript, so offhand I find 
this hard to believe.


> reflected cross-site scripting vulnerability
> 
>
> Key: CONNECTORS-1597
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1597
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Minor
>
> As a result from a pen test, a reflected cross-site scripting vulnerability 
> was discovered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-22 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798910#comment-16798910
 ] 

Karl Wright commented on CONNECTORS-1593:
-

There is a philosophy about memory consumption that we rigorously adhere to in 
ManifoldCF which is known as the "bounded memory consumption" philosophy, which 
is that connectors must be written so they are not sensitive to the size of the 
data they are indexing.  Streams are used and the data does not ever "hit 
memory".  But if you aren't careful, the custom connector you have might well 
put entire documents into memory and then of course all you need would be two 
large documents at the same time and you are hosed.  Can you check your custom 
connector for that issue?

If there is a problem there, you could work around it by limiting the number of 
custom connector connections to 1.  If that works reliably, then you know where 
the issue is.


> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: image-2019-03-22-08-57-53-887.png
>
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 

[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-19 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796057#comment-16796057
 ] 

Karl Wright commented on CONNECTORS-1593:
-

So just to be sure we are in agreement: Your Tika extractor connection has a 
max number of connections set to just 2, and you are still seeing "out of 
memory" with 8GB??

The max connections for the repository connections and for the output 
connections can be set to anything that makes sense for the services they are 
connecting to.  These numbers are all independent, but obviously there will be 
throttling that takes place as a result of not having sufficient connections 
available at all times in your pipeline.  The other connections and worker 
thread count obviously all contribute to the maximum memory consumption as well 
-- if you have other connectors that are not written with memory limits in mind 
then you can easily run into problems of this kind, and the true culprit that 
is driving memory consumption might have nothing to do with the stack trace you 
see.  May I ask what connectors are involved?  Are any custom?


> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 

[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-19 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796003#comment-16796003
 ] 

Karl Wright commented on CONNECTORS-1593:
-

[~DonaldVdD], then from your description it sounds like the problem isn't with 
PDFBox.  It's because you have more worker threads allowed than you have memory 
available.  So if all of the worker threads wind up working on memory-expensive 
documents at the same time, they collide and you run out of memory.

How many worker threads did you allot?



> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> 

[jira] [Comment Edited] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-19 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796003#comment-16796003
 ] 

Karl Wright edited comment on CONNECTORS-1593 at 3/19/19 11:52 AM:
---

[~DonaldVdD], then from your description it sounds like the problem isn't with 
PDFBox.  It's because you have more worker threads allowed than you have memory 
available.  So if all of the worker threads wind up working on memory-expensive 
documents at the same time, they collide and you run out of memory.

How many worker threads did you allot?  How many Tika transformer connections 
to you have specified as the max?  The real number of simultaneous Tika 
extractions that take place will be the minimum of these two values.





was (Author: kwri...@metacarta.com):
[~DonaldVdD], then from your description it sounds like the problem isn't with 
PDFBox.  It's because you have more worker threads allowed than you have memory 
available.  So if all of the worker threads wind up working on memory-expensive 
documents at the same time, they collide and you run out of memory.

How many worker threads did you allot?



> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> 

[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-19 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795983#comment-16795983
 ] 

Karl Wright commented on CONNECTORS-1593:
-

The other possibility for figuring this out is to use the external Tika 
service, and then the MCF agents process won't be killed while crawling.  
Instead, errors will be logged for the specific documents that cause the issue.


> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> 

[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-19 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795982#comment-16795982
 ] 

Karl Wright commented on CONNECTORS-1593:
-

[~DonaldVdD], I think you will need to identify the document and make it 
available to them (if possible).  That's not going to be easy I'm afraid but 
maybe with connector logging turned on it might be possible.


> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> 

[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-19 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795947#comment-16795947
 ] 

Karl Wright commented on CONNECTORS-1593:
-

No suggestions, unfortunately.  Can you let me know what the PDFBox ticket is?


> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> 

[jira] [Assigned] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-19 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1593:
---

Assignee: Karl Wright

> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
> Mar 16 14:20:06 manifold01 

[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update "last checked" time

2019-03-19 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795722#comment-16795722
 ] 

Karl Wright commented on CONNECTORS-880:


[~SubasiniR], your issue has nothing whatsoever to do with this ticket.  It 
really belongs first on the user list.

The issue is that your database is going offline for 2700 seconds while your 
crawl is taking place, or almost 45 minutes.  Queries that normally would be 
instantaneous are therefore just not being completed at all for that period of 
time.  The plans look fine so that isn't it.

If this is using HSQLDB (which is the default database for the single-process 
example), then you probably have exceeded its capacity.  It stores all of its 
tables in memory.  You will want to upgrade to a real database instead.  I 
would preter postgresql over mysql because mysql has been having transactional 
integrity issues for a couple of versions now, and that will be fatal to use 
with ManifoldCF.

By the way, "Illegal seed URL" is a warning and does not impact behavior other 
than to notify you that one of the seeds you are using in your crawl is not 
valid according to the w3c spec.  The seed will not be used.





> Under the right conditions, job aborts do not update "last checked" time
> 
>
> Key: CONNECTORS-880
> URL: https://issues.apache.org/jira/browse/CONNECTORS-880
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework crawler agent
>Affects Versions: ManifoldCF 1.4.1
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 1.6
>
>
> When a scheduled job is being considered to be started, MCF updates the 
> last-check field ONLY if the job didn't start.  It relies on the job's 
> completion to set the last-check field in the case where the job does start.  
> But if the job aborts, in at least one case the last-check field is NOT 
> updated.  This leads to the job being run over and over again within the 
> schedule window.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1591) RTF comment parsing problem

2019-03-12 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1591:

Fix Version/s: ManifoldCF 2.13

> RTF comment parsing problem
> ---
>
> Key: CONNECTORS-1591
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1591
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Zoltan Farago
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: comment.rtf, result.txt
>
>
> We have a problem with Manifold/Tika. When a comment is parsed from and RTF 
> file, the result has no separator. see attachments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2838) RTF document processing glues comment fields together with text without whitespace

2019-03-12 Thread Karl Wright (JIRA)
Karl Wright created TIKA-2838:
-

 Summary: RTF document processing glues comment fields together 
with text without whitespace
 Key: TIKA-2838
 URL: https://issues.apache.org/jira/browse/TIKA-2838
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.19, 1.17
Reporter: Karl Wright


See ManifoldCF ticket CONNECTORS-1591 for a sample document and a description 
of the problem.  Basically, comment fields for RTF documents are glued together 
with no whitespace between them, while other document formats properly put in a 
space (e.g. .docx etc).




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1591) RTF comment parsing problem

2019-03-12 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1591:
---

Assignee: Karl Wright

> RTF comment parsing problem
> ---
>
> Key: CONNECTORS-1591
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1591
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Zoltan Farago
>Assignee: Karl Wright
>Priority: Major
> Attachments: comment.rtf, result.txt
>
>
> We have a problem with Manifold/Tika. When a comment is parsed from and RTF 
> file, the result has no separator. see attachments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1591) RTF comment parsing problem

2019-03-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790343#comment-16790343
 ] 

Karl Wright commented on CONNECTORS-1591:
-

Hi [~zfarago], I think the right approach here is to leave this ticket open and 
link to a TIKA ticket describing your problem.  The issue is not really related 
to ManifoldCF itself, and we cannot solve it for you until the Tika team 
corrects the issue.

I'll go ahead and create the linked ticket.


> RTF comment parsing problem
> ---
>
> Key: CONNECTORS-1591
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1591
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Zoltan Farago
>Priority: Major
> Attachments: comment.rtf, result.txt
>
>
> We have a problem with Manifold/Tika. When a comment is parsed from and RTF 
> file, the result has no separator. see attachments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1591) RTF comment parsing problem

2019-03-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790294#comment-16790294
 ] 

Karl Wright edited comment on CONNECTORS-1591 at 3/12/19 7:18 AM:
--

[~zfarago]  Ok, we're getting closer.

What version of ManifoldCF is this? And, are you using the ES mapper attachment?




was (Author: kwri...@metacarta.com):
[~zfarago]  Ok, we're getting closer.

What version of ManifoldCF is this?


> RTF comment parsing problem
> ---
>
> Key: CONNECTORS-1591
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1591
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Zoltan Farago
>Priority: Major
> Attachments: comment.rtf, result.txt
>
>
> We have a problem with Manifold/Tika. When a comment is parsed from and RTF 
> file, the result has no separator. see attachments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1591) RTF comment parsing problem

2019-03-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790294#comment-16790294
 ] 

Karl Wright commented on CONNECTORS-1591:
-

[~zfarago]  Ok, we're getting closer.

What version of ManifoldCF is this?


> RTF comment parsing problem
> ---
>
> Key: CONNECTORS-1591
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1591
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Zoltan Farago
>Priority: Major
> Attachments: comment.rtf, result.txt
>
>
> We have a problem with Manifold/Tika. When a comment is parsed from and RTF 
> file, the result has no separator. see attachments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1591) RTF comment parsing problem

2019-03-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789712#comment-16789712
 ] 

Karl Wright commented on CONNECTORS-1591:
-

[~zfarago] When you run a ManifoldCF job that fetches an RTF document and runs 
it through the Tika extractor, what comes out is a stream of characters (the 
content stream) plus various metadata fields.  All of these are sent to the 
output connector, which then does whatever it wants with these.

You *cannot* see the content stream nor the metadata directly.  So I need to 
know where you are getting result.txt from.  There is a missing step that you 
aren't telling me about and it's a critical one.


> RTF comment parsing problem
> ---
>
> Key: CONNECTORS-1591
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1591
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Zoltan Farago
>Priority: Major
> Attachments: comment.rtf, result.txt
>
>
> We have a problem with Manifold/Tika. When a comment is parsed from and RTF 
> file, the result has no separator. see attachments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1591) RTF comment parsing problem

2019-03-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789698#comment-16789698
 ] 

Karl Wright commented on CONNECTORS-1591:
-

I will repeat the question. *Where* is result.txt coming from?  Where are you 
finding it?  Is it content or metadata?  If metadata, what metadata field?



> RTF comment parsing problem
> ---
>
> Key: CONNECTORS-1591
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1591
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Zoltan Farago
>Priority: Major
> Attachments: comment.rtf, result.txt
>
>
> We have a problem with Manifold/Tika. When a comment is parsed from and RTF 
> file, the result has no separator. see attachments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1591) RTF comment parsing problem

2019-03-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789564#comment-16789564
 ] 

Karl Wright commented on CONNECTORS-1591:
-

Hi [~zfarago], the result.xml file you attached is certainly not xml.  Was this 
intended?  In its current form I have no idea what this is and what it's 
supposed to represent and where you got it from exactly.  Please clarify that, 
and also clarify what you *expect* to see.  Bear in mind that if you are 
looking at the actual content or metadata output of the Tika Extractor, it's no 
help to create a ticket against ManifoldCF for that.  We do not develop Tika 
and there nothing we could do other than open a Tika ticket.  So I suggest that 
you do that instead.


> RTF comment parsing problem
> ---
>
> Key: CONNECTORS-1591
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1591
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Zoltan Farago
>Priority: Major
> Attachments: comment.rtf, result.xml
>
>
> We have a problem with Manifold/Tika. When a comment is parsed from and RTF 
> file, the result has no separator. see attachments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1590) Resources should be closed in a finally block

2019-03-03 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1590:
---

Assignee: Karl Wright

> Resources should be closed in a finally block
> -
>
> Key: CONNECTORS-1590
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1590
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.12
>Reporter: Cihad Guzel
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> {code:java}
> public class DBInterfaceHSQLDB extends Database implements IDBInterface {
>   ...
>   try
>   {
> Connection c = 
> DriverManager.getConnection(_localUrl+databaseName,userName,password);
> Statement s = c.createStatement();
> s.execute("SHUTDOWN");
> c.close();
>   }
>   catch (Exception e)
>   {
> // Never any exception!
> e.printStackTrace();
>   }
> {code}
> Connections that implement the _Closeable_ interface or its super-interface, 
> _AutoCloseable_, needs to be closed after use. That close call must be made 
> in a *finally* block otherwise an exception could keep the call from being 
> made.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1590) Resources should be closed in a finally block

2019-03-03 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1590.
-
Resolution: Won't Fix

> Resources should be closed in a finally block
> -
>
> Key: CONNECTORS-1590
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1590
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.12
>Reporter: Cihad Guzel
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> {code:java}
> public class DBInterfaceHSQLDB extends Database implements IDBInterface {
>   ...
>   try
>   {
> Connection c = 
> DriverManager.getConnection(_localUrl+databaseName,userName,password);
> Statement s = c.createStatement();
> s.execute("SHUTDOWN");
> c.close();
>   }
>   catch (Exception e)
>   {
> // Never any exception!
> e.printStackTrace();
>   }
> {code}
> Connections that implement the _Closeable_ interface or its super-interface, 
> _AutoCloseable_, needs to be closed after use. That close call must be made 
> in a *finally* block otherwise an exception could keep the call from being 
> made.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1590) Resources should be closed in a finally block

2019-03-03 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782728#comment-16782728
 ] 

Karl Wright commented on CONNECTORS-1590:
-

This particular invocation is only ever invoked when we're shutting down 
ManifoldCF.  You will find that there are no other such invocations where 
resources are potentially leaked.  In this case, the leak is harmless because 
we are in the process of shutting down anyway.

I don't believe this requires a "fix".


> Resources should be closed in a finally block
> -
>
> Key: CONNECTORS-1590
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1590
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.12
>Reporter: Cihad Guzel
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> {code:java}
> public class DBInterfaceHSQLDB extends Database implements IDBInterface {
>   ...
>   try
>   {
> Connection c = 
> DriverManager.getConnection(_localUrl+databaseName,userName,password);
> Statement s = c.createStatement();
> s.execute("SHUTDOWN");
> c.close();
>   }
>   catch (Exception e)
>   {
> // Never any exception!
> e.printStackTrace();
>   }
> {code}
> Connections that implement the _Closeable_ interface or its super-interface, 
> _AutoCloseable_, needs to be closed after use. That close call must be made 
> in a *finally* block otherwise an exception could keep the call from being 
> made.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1589) lrusize always null

2019-03-03 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1589.
-
Resolution: Fixed

r1854702


> lrusize always null
> ---
>
> Key: CONNECTORS-1589
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1589
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.12
>Reporter: Cihad Guzel
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> {code:java}
> public abstract class BaseDescription implements ICacheDescription {
> ...
> public int getMaxLRUCount()
>...
>   String x = null; // 
> JSKW.getProperty("cache."+objectClassName+".lrusize");
>   if (x == null)
>...
> {code}
> Change this condition so that it does not always evaluate to "true"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1589) lrusize always null

2019-03-03 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782724#comment-16782724
 ] 

Karl Wright commented on CONNECTORS-1589:
-

This requires an infrastructure change.

The infrastructure change requires the use of a different constructor when 
property control over max lru count is desired for a class of cached object.

I've looked at the places where BaseDescription is extended, and in no case did 
I find a compelling case for using properties to control LRU max size.  So I've 
left those alone.



> lrusize always null
> ---
>
> Key: CONNECTORS-1589
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1589
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.12
>Reporter: Cihad Guzel
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> {code:java}
> public abstract class BaseDescription implements ICacheDescription {
> ...
> public int getMaxLRUCount()
>...
>   String x = null; // 
> JSKW.getProperty("cache."+objectClassName+".lrusize");
>   if (x == null)
>...
> {code}
> Change this condition so that it does not always evaluate to "true"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1589) lrusize always null

2019-03-03 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1589:
---

Assignee: Karl Wright

> lrusize always null
> ---
>
> Key: CONNECTORS-1589
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1589
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.12
>Reporter: Cihad Guzel
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> {code:java}
> public abstract class BaseDescription implements ICacheDescription {
> ...
> public int getMaxLRUCount()
>...
>   String x = null; // 
> JSKW.getProperty("cache."+objectClassName+".lrusize");
>   if (x == null)
>...
> {code}
> Change this condition so that it does not always evaluate to "true"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1588) Custom Jcifs Properties

2019-02-28 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1588:
---

Assignee: Karl Wright

> Custom Jcifs Properties
> ---
>
> Key: CONNECTORS-1588
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1588
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Cihad Guzel
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: CONNECTORS-1588
>
>
> In some cases, "jcifs" is running slowly. In order to solve this problem, we 
> need to set custom some properties. 
>   
>  For example; my problem was in my test environment: I have a windows server 
> and an ubuntu server in same network in AWS EC2 Service. The windows server 
> has Active Directory service, DNS Server and shared folder while the ubuntu 
> server has some instance such as manifoldcf, an db instance and solr. 
>   
>  If the DNS settings are not defined on the ubuntu server, jcifs runs slowly. 
> Because the default resolver order is set as 'LMHOSTS,DNS,WINS'. It means[1] 
> ; firstly "jcifs" checks '/etc/hosts' files for linux/unix server'', then it 
> checks the DNS server. In my opinion, the linux server doesn't recognize the 
> DNS server and threads are waiting for every file for access to read.
>   
>  I suppose, WINS is used when accessing hosts on different subnets. So, I 
> have set "jcifs.resolveOrder = WINS" and my problem has been FIXED. 
>   
>  Another suggestion for similar problem from [another 
> example|https://stackoverflow.com/a/18837754] : "-Djcifs.resolveOrder = DNS"
>   
>  We need to set custom resolveOrder variable.
> ^[1]^ [https://www.jcifs.org/src/docs/resolver.html] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1588) Custom Jcifs Properties

2019-02-28 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780473#comment-16780473
 ] 

Karl Wright commented on CONNECTORS-1588:
-

Patch looks fine.  I'll commit it.


> Custom Jcifs Properties
> ---
>
> Key: CONNECTORS-1588
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1588
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Cihad Guzel
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: CONNECTORS-1588
>
>
> In some cases, "jcifs" is running slowly. In order to solve this problem, we 
> need to set custom some properties. 
>   
>  For example; my problem was in my test environment: I have a windows server 
> and an ubuntu server in same network in AWS EC2 Service. The windows server 
> has Active Directory service, DNS Server and shared folder while the ubuntu 
> server has some instance such as manifoldcf, an db instance and solr. 
>   
>  If the DNS settings are not defined on the ubuntu server, jcifs runs slowly. 
> Because the default resolver order is set as 'LMHOSTS,DNS,WINS'. It means[1] 
> ; firstly "jcifs" checks '/etc/hosts' files for linux/unix server'', then it 
> checks the DNS server. In my opinion, the linux server doesn't recognize the 
> DNS server and threads are waiting for every file for access to read.
>   
>  I suppose, WINS is used when accessing hosts on different subnets. So, I 
> have set "jcifs.resolveOrder = WINS" and my problem has been FIXED. 
>   
>  Another suggestion for similar problem from [another 
> example|https://stackoverflow.com/a/18837754] : "-Djcifs.resolveOrder = DNS"
>   
>  We need to set custom resolveOrder variable.
> ^[1]^ [https://www.jcifs.org/src/docs/resolver.html] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-02-28 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780373#comment-16780373
 ] 

Karl Wright commented on CONNECTORS-1563:
-

Hi [~Subasini],

The "excluded mime types" that you set are meant to exclude documents 
*entirely*, so changing that setting has no effect on *how* documents are 
indexed.  You can look at the Simple History report to verify that this is 
taking place as you desire, because most connectors create a record when they 
reject a document for any reason.  The Web Connector is no exception.


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: Document simple history.docx, Manifold and Solr 
> settings_CustomField.docx, managed-schema, manifold settings.docx, 
> manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-26 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved LUCENE-8696.
-
   Resolution: Fixed
Fix Version/s: 7.7.2
   master (9.0)
   8.x

> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Fix For: 8.x, master (9.0), 7.7.2
>
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (CONNECTORS-1587) Unable to Crawl Documents Meta data

2019-02-26 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778989#comment-16778989
 ] 

Karl Wright commented on CONNECTORS-1587:
-

It is simple; the crawler is requesting more metadata columns at one time than 
your Sharepoint instance is allowed to respond to.  This is a SharePoint 
configuration issue, apparently, although it is one I've never heard of before. 
 It's certainly *not* a SharePoint Connector issue, unless there's some 
hard-wired Microsoft limit that you are up against.


> Unable to Crawl Documents Meta data
> ---
>
> Key: CONNECTORS-1587
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1587
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Priority: Major
>
> I tried to crawl the meta data of Document section. but cannot able to crawl 
> the data.
> I have facing error stating that " The query cannot be completed because the 
> number of lookup columns it contains exceeds the lookup column threshold 
> enforced by the administrator."
> How can I resolve this issue.Is there any config needs for that.  
> Please assist the same.
> While checking for documentation mentioned the meta data contents as 
> drop-down type but my connector(Manifold 2.9.1)  is different. Is there any 
> version update is there for this lookup.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1587) Unable to Crawl Documents Meta data

2019-02-26 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1587.
-
Resolution: Invalid

Not a ManifoldCF bug

> Unable to Crawl Documents Meta data
> ---
>
> Key: CONNECTORS-1587
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1587
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Priority: Major
>
> I tried to crawl the meta data of Document section. but cannot able to crawl 
> the data.
> I have facing error stating that " The query cannot be completed because the 
> number of lookup columns it contains exceeds the lookup column threshold 
> enforced by the administrator."
> How can I resolve this issue.Is there any config needs for that.  
> Please assist the same.
> While checking for documentation mentioned the meta data contents as 
> drop-down type but my connector(Manifold 2.9.1)  is different. Is there any 
> version update is there for this lookup.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-02-26 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778159#comment-16778159
 ] 

Karl Wright commented on CONNECTORS-1564:
-

[~erlendfg], if ModifiedHttpSolrClient overrides this setting already, then I 
don't understand why it isn't working, unless the override isn't setting it.  
Is that the case?  If so, then the obvious fix is to just set it there.



> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-26 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778005#comment-16778005
 ] 

Karl Wright commented on LUCENE-8696:
-

I have confirmed that the above is indeed the issue.  I did this by checking 
whether intersection with the segment end planes was taking place, and it was.

There are two ways forward.  First way is to make this hack officially part of 
the code base.  That will probably be fine for real-world paths, because 
real-world paths are much narrower than what occurs in random testing.  The 
second fix would be to change how we represent segment endpoints, so that there 
is no gap between one of the points and the adjoining path segment.  The way to 
do that is to use TWO planes rather than one, but only when there are two 
adjoining segments and a gap is thus present.  Membership would be tricky 
because, depending on the specific conformation of the segment endpoint, EITHER 
plane or BOTH planes would need to match the point being tested.  But we could 
determine this by simply looking at the fourth point in the context of a plane 
constructed from the other three.

Such a change would finally make GeoPaths first-class citizens in the oblate 
world, at the cost of needing to have a second plane for each segment endpoint. 
 But there's no reason we can't use class inheritance to solve that problem 
too.  So a base SegmentEndpoint class or interface would have multiple 
implementations, and the right one could be picked at path construction time, 
to match the conformation.  For SPHERE planets, the simplest implementation 
would still be the one that got used.


> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777688#comment-16777688
 ] 

Karl Wright commented on LUCENE-8696:
-

Since we've eliminated the computation of the solid's example intersection 
points, that basically leaves numerical factors as the only potential cause.  
Let's examine this further.

In the case of GeoPaths on the WGS84 globe, path intersection points are 
described by "circles", which are in fact just planes that are picked so as to 
connect the path segments together, as described above.  But, each plane with 
two adjoining segments is selected based on FOUR surface points, not three.  
That means that there is a gap between one of the points and the actual 
endpoint circle.

When we compute membership in the path, we exclude points in that gap from 
membership.  This is done by considering the path segment end planes as 
delimiters of membership for both the endpoint "circles" as well as the 
segments.  But, those segment end planes are not considered when determining 
intersection, because they are "interior" to the path.  This means that it is 
possible for getRelationship() to miss an intersection with the path edge if 
the "gap" is large enough and everything lines up perfectly, and thus 
"CONTAINS" is reported where "OVERLAPS" would be the actual correct answer.

It should be possible to see if our test case would be resolved by considering 
path segment end edges.  A simple trial code change should be sufficient to 
know.  Then the question becomes how to prevent spurious intersections?  We 
could just permit them (it's allowed in the contract), or we could make more 
significant changes to path representation, for better accuracy.  Stay tuned.


> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776899#comment-16776899
 ] 

Karl Wright edited comment on LUCENE-8696 at 2/26/19 7:24 AM:
--

Reviewing the solid, and what the edge points *should* be:

minx, maxx:  -0.7731590077686981, 1.0011188539924791
miny, maxy:  0.9519964046486451, 1.0011188539924791
minz, maxz:  -0.9977622932859775, 0.9977599768255027

The minz/maxz planes might touch the world at the poles, but probably don't.
The maxx plane might touch the world at the max X pole.
The minx plane definitely slices the world, so it should generate at least one 
point.
The maxy plane might touch the world at the max Y pole.
The miny plane slices the world, so it should generate at least one point.

This is the debugging output:

{code}
  [junit4]   2>  notableMinXPoints=[] notableMaxXPoints=[] notableMinYPoints=[] 
notableMaxYPoints=[] notableMinZPoints=[] notableMaxZPoints=[]
   [junit4]   2>  minXEdges=[] maxXEdges=[] minYEdges=[[X=0.0, 
Y=0.9519964046486451, Z=-0.30870622678085735]] maxYEdges=[[X=-0.0, 
Y=1.0011188539924791, Z=0.0]] minZEdges=[] maxZEdges=[]
{code}

"Notable points" are places where the plane intersections also intersect the 
world.  There are none of these, as expected.

The planes that intersect the world are minY and maxY.  We do *not* see 
intersections for minX, though, and we expected to.  That's got to be 
researched to figure out why.  It may be because the intersection is actually 
outside the solid bounds as determined by the Y plane.

So the question becomes whether the line (-0.7731590077686981, 
0.9519964046486451, t) ever can go through the world?  We can surely determine 
that by picking value 0, and computing the distance to the origin:

sqrt(x^2 + y^2 + 0) = 1.2264061340998847885343642874005, which is indeed off 
the surface.  So the points look reasonable.




was (Author: kwri...@metacarta.com):
Reviewing the solid, and what the edge points *should* be:

minx, maxx:  -0.7731590077686981, 1.0011188539924791
miny, maxy:  0.9519964046486451, 1.0011188539924791
minz, maxz:  -0.9977622932859775, 0.9977599768255027

The minz/maxz planes might touch the world at the poles, but probably don't.
The maxx plane might touch the world at the max X pole.
The minx plane definitely slices the world, so it should generate at least one 
point.
The maxy plane might touch the world at the max Y pole.
The miny plane slices the world, so it should generate at least one point.

This is the debugging output:

{code}
  [junit4]   2>  notableMinXPoints=[] notableMaxXPoints=[] notableMinYPoints=[] 
notableMaxYPoints=[] notableMinZPoints=[] notableMaxZPoints=[]
   [junit4]   2>  minXEdges=[] maxXEdges=[] minYEdges=[[X=0.0, 
Y=0.9519964046486451, Z=-0.30870622678085735]] maxYEdges=[[X=-0.0, 
Y=1.0011188539924791, Z=0.0]] minZEdges=[] maxZEdges=[]
{code}

"Notable points" are places where the plane intersections also intersect the 
world.  There are none of these, as expected.

The planes that intersect the world are minY and maxY.  We do *not* see 
intersections for minX, though, and we expected to.  That's got to be 
researched to figure out why.  It may be because the intersection is actually 
outside the solid bounds as determined by the Y plane.

Out of time for the moment though.



> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--

[jira] [Commented] (SOLR-13270) SolrJ does not send "Expect: 100-continue" header

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777186#comment-16777186
 ] 

Karl Wright commented on SOLR-13270:


I just grepped for it and did not find it explicitly set:

{code}
kawright@1USDKAWRIGHT:/mnt/c/wipgit/lucene4/lucene-solr$ grep -R 
"setExpectContinue" . --include "*.java"
kawright@1USDKAWRIGHT:/mnt/c/wipgit/lucene4/lucene-solr$
{code}

I therefore believe it's being set because the RequestConfig is being 
overwritten.  And, sure enough:

{code}
kawright@1USDKAWRIGHT:/mnt/c/wipgit/lucene4/lucene-solr$ grep -R 
"setDefaultRequestConfig" . --include "*.java"
./lucene/replicator/src/java/org/apache/lucene/replicator/http/HttpClientBase.java:
httpc = 
HttpClientBuilder.create().setConnectionManager(conMgr).setDefaultRequestConfig(this.defaultConfig).build();
./solr/solrj/src/java/org/apache/solr/client/solrj/impl/HttpClientUtil.java:
HttpClientBuilder retBuilder = builder.setDefaultRequestConfig(requestConfig);
kawright@1USDKAWRIGHT:/mnt/c/wipgit/lucene4/lucene-solr$
{code}


> SolrJ does not send "Expect: 100-continue" header
> -
>
> Key: SOLR-13270
> URL: https://issues.apache.org/jira/browse/SOLR-13270
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrJ
>Affects Versions: 7.7
>Reporter: Erlend Garåsen
>Priority: Major
>
> SolrJ does not set the "Expect: 100-continue" header, even though it's 
> configured in HttpClient:
> {code:java}
> builder.setDefaultRequestConfig(RequestConfig.custom().setExpectContinueEnabled(true).build());{code}
> A HttpClient developer has reviewed the code and says we're setting up
>  the client correctly, so we have a reason to believe there is a bug in
>  SolrJ. It's actually a problem we are facing in ManifoldCF, explained in:
>  https://issues.apache.org/jira/browse/CONNECTORS-1564
> The problem can be reproduced by building and running the following small 
> Maven project:
> [http://folk.uio.no/erlendfg/solr/missing-header.zip]
> The application runs SolrJ code where the header does not show up and 
> HttpClient code where the header is present.
>  
> {code:java}
> HttpClientBuilder builder = HttpClients.custom();
> // This should add an Expect: 100-continue header:
> builder.setDefaultRequestConfig(RequestConfig.custom().setExpectContinueEnabled(true).build());
> HttpClient httpClient = builder.build();
> // Start Solr and create a core named "test".
> String baseUrl = "http://localhost:8983/solr/test;;
> // Test using SolrJ — no expect 100 header
> HttpSolrClient client = new HttpSolrClient.Builder()
>   .withHttpClient(httpClient)
>   .withBaseSolrUrl(baseUrl).build();
> SolrQuery query = new SolrQuery();
> query.setQuery("*:*");
> client.query(query);
> // Test using HttpClient directly — expect 100 header shows up:
> HttpPost httpPost = new HttpPost(baseUrl);
> HttpEntity entity = new InputStreamEntity(new 
> ByteArrayInputStream("test".getBytes()));
> httpPost.setEntity(entity);
> httpClient.execute(httpPost);
> {code}
> When using the last HttpClient test, the expect 100 header appears in 
> missing-header.log:
> {noformat}
> http-outgoing-1 >> Expect: 100-continue{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13270) SolrJ does not send "Expect: 100-continue" header

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777113#comment-16777113
 ] 

Karl Wright commented on SOLR-13270:


Hi [~erlendfg], can you identify where in the SolrJ code it explicitly sets 
expect/continue to "off"?  It must be there somewhere.


> SolrJ does not send "Expect: 100-continue" header
> -
>
> Key: SOLR-13270
> URL: https://issues.apache.org/jira/browse/SOLR-13270
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrJ
>Affects Versions: 7.7
>Reporter: Erlend Garåsen
>Priority: Major
>
> SolrJ does not set the "Expect: 100-continue" header, even though it's 
> configured in HttpClient:
> {code:java}
> builder.setDefaultRequestConfig(RequestConfig.custom().setExpectContinueEnabled(true).build());{code}
> A HttpClient developer has reviewed the code and says we're setting up
>  the client correctly, so we have a reason to believe there is a bug in
>  SolrJ. It's actually a problem we are facing in ManifoldCF, explained in:
>  https://issues.apache.org/jira/browse/CONNECTORS-1564
> The problem can be reproduced by building and running the following small 
> Maven project:
> [http://folk.uio.no/erlendfg/solr/missing-header.zip]
> The application runs SolrJ code where the header does not show up and 
> HttpClient code where the header is present.
>  
> {code:java}
> HttpClientBuilder builder = HttpClients.custom();
> // This should add an Expect: 100-continue header:
> builder.setDefaultRequestConfig(RequestConfig.custom().setExpectContinueEnabled(true).build());
> HttpClient httpClient = builder.build();
> // Start Solr and create a core named "test".
> String baseUrl = "http://localhost:8983/solr/test;;
> // Test using SolrJ — no expect 100 header
> HttpSolrClient client = new HttpSolrClient.Builder()
>   .withHttpClient(httpClient)
>   .withBaseSolrUrl(baseUrl).build();
> SolrQuery query = new SolrQuery();
> query.setQuery("*:*");
> client.query(query);
> // Test using HttpClient directly — expect 100 header shows up:
> HttpPost httpPost = new HttpPost(baseUrl);
> HttpEntity entity = new InputStreamEntity(new 
> ByteArrayInputStream("test".getBytes()));
> httpPost.setEntity(entity);
> httpClient.execute(httpPost);
> {code}
> When using the last HttpClient test, the expect 100 header appears in 
> missing-header.log:
> {noformat}
> http-outgoing-1 >> Expect: 100-continue{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776975#comment-16776975
 ] 

Karl Wright commented on LUCENE-8696:
-

[~jpountz], should be addressed now.


> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776899#comment-16776899
 ] 

Karl Wright edited comment on LUCENE-8696 at 2/25/19 2:39 PM:
--

Reviewing the solid, and what the edge points *should* be:

minx, maxx:  -0.7731590077686981, 1.0011188539924791
miny, maxy:  0.9519964046486451, 1.0011188539924791
minz, maxz:  -0.9977622932859775, 0.9977599768255027

The minz/maxz planes might touch the world at the poles, but probably don't.
The maxx plane might touch the world at the max X pole.
The minx plane definitely slices the world, so it should generate at least one 
point.
The maxy plane might touch the world at the max Y pole.
The miny plane slices the world, so it should generate at least one point.

This is the debugging output:

{code}
  [junit4]   2>  notableMinXPoints=[] notableMaxXPoints=[] notableMinYPoints=[] 
notableMaxYPoints=[] notableMinZPoints=[] notableMaxZPoints=[]
   [junit4]   2>  minXEdges=[] maxXEdges=[] minYEdges=[[X=0.0, 
Y=0.9519964046486451, Z=-0.30870622678085735]] maxYEdges=[[X=-0.0, 
Y=1.0011188539924791, Z=0.0]] minZEdges=[] maxZEdges=[]
{code}

"Notable points" are places where the plane intersections also intersect the 
world.  There are none of these, as expected.

The planes that intersect the world are minY and maxY.  We do *not* see 
intersections for minX, though, and we expected to.  That's got to be 
researched to figure out why.  It may be because the intersection is actually 
outside the solid bounds as determined by the Y plane.

Out of time for the moment though.




was (Author: kwri...@metacarta.com):
Reviewing the solid, and what the edge points *should* be:

minx, maxx:  -0.7731590077686981, 1.0011188539924791
miny, maxy:  0.9519964046486451, 1.0011188539924791
minz, maxz:  -0.9977622932859775, 0.9977599768255027

The minz/maxz planes might touch the world at the poles, but probably don't.
The maxx plane might touch the world at the max X pole.
The minx plane definitely slices the world, so it should generate at least one 
point.
The maxy plane might touch the world at the max Y pole.
The miny plane slices the world, so it should generate at least one point.

We therefore should expect a minimum of two points, which is what we see.  If 
any of these planes actually encounters the pole, though, we should have gotten 
another point from that.  The maxZ plane looks potentially like it might 
qualify.  Out of time for the moment though.



> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776899#comment-16776899
 ] 

Karl Wright commented on LUCENE-8696:
-

Reviewing the solid, and what the edge points *should* be:

minx, maxx:  -0.7731590077686981, 1.0011188539924791
miny, maxy:  0.9519964046486451, 1.0011188539924791
minz, maxz:  -0.9977622932859775, 0.9977599768255027

The minz/maxz planes might touch the world at the poles, but probably don't.
The maxx plane might touch the world at the max X pole.
The minx plane definitely slices the world, so it should generate at least one 
point.
The maxy plane might touch the world at the max Y pole.
The miny plane slices the world, so it should generate at least one point.

We therefore should expect a minimum of two points, which is what we see.  If 
any of these planes actually encounters the pole, though, we should have gotten 
another point from that.  The maxZ plane looks potentially like it might 
qualify.  Out of time for the moment though.



> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776881#comment-16776881
 ] 

Karl Wright commented on LUCENE-8696:
-

Reviewing the solid edge point logic finds nothing wrong.  Will try to rule out 
numerical precision problems next.



> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776841#comment-16776841
 ] 

Karl Wright commented on LUCENE-8696:
-

I've verified that there are two solid edge points and they both lie within the 
path:

{code}
   [junit4]   2>  solid edge point [X=0.0, Y=0.9519964046486451, 
Z=-0.30870622678085735] path.isWithin()? true
   [junit4]   2>  solid edge point [X=-0.0, Y=1.0011188539924791, Z=0.0] 
path.isWithin()? true
   [junit4]   2>  path edge point [X=0.22516844226485835, 
Y=0.003930329545205224, Z=0.9721897091178435] isWithin()? false 
minx=0.9983274500335564 maxx=-0.7759504117276208 miny=-0.9480660751034399 
maxy=-0.9971885244472739 minz=1.969952002403821 maxz=-0.025570267707659133
{code}

So this confirms that there is no intersection detected, and how the conclusion 
that the solid is completely within the path is arrived at.

Possible errors that would cause this:

(1) We might be missing a solid edge point.  These edge points are computed 
based on the lines of intersection between adjoining solid planes and the 
surface of the world.  There is also special computation to handle the case 
where a solid edge plane intersects the world by itself, but this logic might 
not be complete.  We need to capture all plane/world intersection closed curves 
and come up with an example point for each.

(2) There might be numerical precision issues with intersection computation 
that prevent us from concluding that the path edges intersect the solid edges.

I still have to figure out which is the real problem here.


> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776821#comment-16776821
 ] 

Karl Wright commented on LUCENE-8696:
-

Looking at the actual failure now.

Basically, problem is that the relationship between the XYZSolid and the 
GeoPath is containment: the XYZSolid is reported to be inside the GeoPath.  It 
reaches this conclusion because it detects no intersections between the solid 
and the path edges, and because the path edge point it is using is outside the 
solid:

{code}
   [junit4]   1> in isShapeInsideArea
   [junit4]   1>  there are 1 pathPoints
   [junit4]   1>  pathpoint [X=0.22516844226485835, Y=0.003930329545205224, 
Z=0.9721897091178435]...
   [junit4]   1>   outside
{code}

Haven't verified it yet, but this implies that at least one of the solid's 
surface points is inside of the path too.  Still too early to know which 
conclusion is incorrect.


> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776637#comment-16776637
 ] 

Karl Wright commented on LUCENE-8696:
-

I revised the simple test case to match the actual failure, and committed it 
with @AwaitsFix.  I'm now committing to master and to master, branch_7x, and 
branch_8x.  No further fixes for branch_6x.


> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776617#comment-16776617
 ] 

Karl Wright commented on LUCENE-8696:
-

[~ivera], I'm looking at your test case for reproducing the original failure 
and I honestly can't find any place in testGeo3DRelations where we expect two 
paths with different widths to exactly fit inside one another.  The only 
relationships that are computed in this test are between an xyz solid and a 
path.  Can you describe how you came up with the simplified test case?


> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-23 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776010#comment-16776010
 ] 

Karl Wright commented on LUCENE-8696:
-

More debugging shows that the second circle plane is wildly different in the 
two runs:

{code}
   [junit4]   1> Checking 'iswithin' for 0.020717830200521595 
0.9523290534985549 0.30699177254488114
   [junit4]   1>  pathPoint...
   [junit4]   1>   outside of circle [A=0.9998476951745469, 
B=0.01745240539714465, C=-0.0, D=-0.5409068252602056, side=1.0]
   [junit4]   1>  pathPoint...
   [junit4]   1>   passes circle plane [A=0.7071067811865476, 
B=-0.7071067811865476, C=0.0, D=0.05929892163149414, side=-1.0]
   [junit4]   1>   within!
   [junit4]   1> Checking 'iswithin' for 0.020717830200521595 
0.9523290534985549 0.30699177254488114
   [junit4]   1>  pathPoint...
   [junit4]   1>   outside of circle [A=0.9998476951745469, 
B=0.017452405397144648, C=-0.0, D=-0.22520274172912894, side=1.0]
   [junit4]   1>  pathPoint...
   [junit4]   1>   outside of circle [A=0.7863183388224225, 
B=-0.6178215519319035, C=0.0, D=-0.0021572780909792644, side=1.0]
   [junit4]   1>  pathPoint...
   [junit4]   1>   outside of cutoff plane [A=0.6045468388328157, 
B=-0.796569594986684, C=-3.0241383426688587E-48, D=0.0, side=1.0]
   [junit4]   1>  pathPoint...
   [junit4]   1>   outside of cutoff plane [A=-0.6885949363624547, 
B=-0.29030954074708304, C=-0.6644978436136604, D=0.0, side=1.0]
   [junit4]   1>  segment...
   [junit4]   1>  segment...
   [junit4]   1>  segment...
{code}

For the successful run it's: [A=0.7071067811865476, B=-0.7071067811865476, 
C=0.0, D=0.05929892163149414, side=-1.0]
For the failed run it's: [A=0.7863183388224225, B=-0.6178215519319035, C=0.0, 
D=-0.0021572780909792644, side=1.0]

The naive expectation would be that the vector is identical (A,B,C), but the 
displacement differs (D).  But because this is WGS84, that expectation is 
incorrect, because oblateness can affect the vector.  Because of oblateness, 
the circle is constructed from three of the four points where the segment edges 
intersect.  Which three it picks is random, but the hope is that the selection 
is not important.

What this shows is that very wide paths on oblate spheroids are mathematically 
unrelatable to each other.  This is not exactly surprising in retrospect; paths 
were originally designed for a SPHERE world and retrofitting them to WGS84 
involved compromises.

I therefore think the best approach might be to modify the test suite to limit 
the width of paths tested on WGS84.  [~ivera], what do you think?



> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-23 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776006#comment-16776006
 ] 

Karl Wright commented on LUCENE-8696:
-

Added some simple diagnostics.  The difference lies in the construction of the 
second circle plane:

{code}
   [junit4]   1> Checking 'iswithin' for 0.020717830200521595 
0.9523290534985549 0.30699177254488114
   [junit4]   1>  pathPoint...
   [junit4]   1>   outside of circle
   [junit4]   1>  pathPoint...
   [junit4]   1>   within!
   [junit4]   1> Checking 'iswithin' for 0.020717830200521595 
0.9523290534985549 0.30699177254488114
   [junit4]   1>  pathPoint...
   [junit4]   1>   outside of circle
   [junit4]   1>  pathPoint...
   [junit4]   1>   outside of circle
   [junit4]   1>  pathPoint...
   [junit4]   1>   outside of cutoff plane [A=0.6045468388328157, 
B=-0.796569594986684, C=-3.0241383426688587E-48, D=0.0, side=1.0]
   [junit4]   1>  pathPoint...
   [junit4]   1>   outside of cutoff plane [A=-0.6885949363624547, 
B=-0.29030954074708304, C=-0.6644978436136604, D=0.0, side=1.0]
   [junit4]   1>  segment...
   [junit4]   1>  segment...
   [junit4]   1>  segment...
{code}

So the second circle plane accepts the point in the narrower case, but rejects 
it in the wider case.  Digging further.


> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-23 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776002#comment-16776002
 ] 

Karl Wright commented on LUCENE-8696:
-

Hmm, even when I use createSurfacePoint() with this point, it still fails.  So 
I need to look deeper.


> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-23 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775839#comment-16775839
 ] 

Karl Wright commented on LUCENE-8696:
-

Preliminary results indicate that the problem may be due to the fact that the 
point isn't on the surface.  The following test fails:

{code}
GeoPoint check = new GeoPoint(0.02071783020158524, 0.9523290535474472, 
0.30699177256064203);
assertTrue(PlanetModel.WGS84.pointOnSurface(check));
{code}

Because path geometry uses surface circles and parallel slicing planes, they 
can be particularly susceptible to misconstruing membership for points that are 
off the world.  I'll try to confirm this picture.
 

> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (CONNECTORS-1587) Unable to Crawl Documents Meta data

2019-02-22 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775088#comment-16775088
 ] 

Karl Wright commented on CONNECTORS-1587:
-

Can you amend your ticket to tell us what connectors you are using for your 
job?  This ticket is very nearly incomprehensible, and unless it is amended I 
will close it on that basis.


> Unable to Crawl Documents Meta data
> ---
>
> Key: CONNECTORS-1587
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1587
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Priority: Major
>
> I tried to crawl the meta data of Document section. but cannot able to crawl 
> the data.
> I have facing error stating that " The query cannot be completed because the 
> number of lookup columns it contains exceeds the lookup column threshold 
> enforced by the administrator."
> How can I resolve this issue.Is there any config needs for that.  
> Please assist the same.
> While checking for documentation mentioned the meta data contents as 
> drop-down type but my connector(Manifold 2.9.1)  is different. Is there any 
> version update is there for this lookup.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-22 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775050#comment-16775050
 ] 

Karl Wright commented on LUCENE-8696:
-

The path in the test retraces its steps, but that should not be a problem for 
membership testing.  I'll look into it starting this evening.


> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8696.patch
>
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (CONNECTORS-1584) regex documentation

2019-02-21 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774309#comment-16774309
 ] 

Karl Wright commented on CONNECTORS-1584:
-

Have you subscribed to the list?  Instructions are in the documentation for 
"contact us".  You send mail to:

user-subscr...@manifoldcf.apache.org



> regex documentation
> ---
>
> Key: CONNECTORS-1584
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1584
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
>
> What type of regexs does manifold include and exclude support and also in 
> general regex support?
> At the moment i'm using a web repository connection and an Elastic output 
> connection.
>  I'm trying to exclude urls that link to documents.
>           e.g. website.com/document/path/this.pdf and 
> website.com/document/path/other.PDF
> The issue i'm having is that the regex that I have found so far doesn't work 
> case insensitive, so for every possible case i have to add a new line.
>     e.g.:
> {code:java}
> .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code}
> Is it possible to add documentation what type of regex is able to be used or 
> maybe a tool to test your regex and see if it is supported by manifold ?
> I tried mailing this question to 
> [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
> adress returns a failure notice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1584) regex documentation

2019-02-21 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774105#comment-16774105
 ] 

Karl Wright commented on CONNECTORS-1584:
-

Actually, it *is* user@ but so many people get mixed up with that that I got it 
backwards myself.

What failure notice did you get when you mailed to user@?  I receive email from 
this list a dozen times a day or more so I am not sure why you'd be having 
trouble.


> regex documentation
> ---
>
> Key: CONNECTORS-1584
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1584
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
>
> What type of regexs does manifold include and exclude support and also in 
> general regex support?
> At the moment i'm using a web repository connection and an Elastic output 
> connection.
>  I'm trying to exclude urls that link to documents.
>           e.g. website.com/document/path/this.pdf and 
> website.com/document/path/other.PDF
> The issue i'm having is that the regex that I have found so far doesn't work 
> case insensitive, so for every possible case i have to add a new line.
>     e.g.:
> {code:java}
> .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code}
> Is it possible to add documentation what type of regex is able to be used or 
> maybe a tool to test your regex and see if it is supported by manifold ?
> I tried mailing this question to 
> [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
> adress returns a failure notice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-02-20 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772995#comment-16772995
 ] 

Karl Wright commented on CONNECTORS-1563:
-

[~Subasini], we are trying to debug your setup.  The first principle of 
debugging is to identify where exactly the problem is occurring.  It eliminates 
one variable.  The file system connector is quite simple and has few 
configuration options, so it should be easy to set something up we can use to 
evaluate your solr connection.

Thanks,
Karl

> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: Document simple history.docx, Manifold and Solr 
> settings_CustomField.docx, managed-schema, manifold settings.docx, 
> manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-02-20 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772912#comment-16772912
 ] 

Karl Wright commented on CONNECTORS-1563:
-

[~Subasini], the "error" is because it does not recognize a specific 
translation bundle for your language, so it defaults to English.  It is 
harmless.

I asked you to *try* working with a File System connection initially to narrow 
down where your problems were coming from.  Please do so.  [~shinichiro abe] 
and myself both tried a configuration similar to the one you report end of last 
year when we were debugging the 2.11 release of ManifoldCF.

> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: Document simple history.docx, Manifold and Solr 
> settings_CustomField.docx, managed-schema, manifold settings.docx, 
> manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-02-19 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772729#comment-16772729
 ] 

Karl Wright commented on CONNECTORS-1563:
-

In general in cases like this I recommend that people start with the simplest 
possible working configuration and then modify it until they achieve their 
goals.  In this case that would mean starting with a file system job and a 
freshly-installed Solr instance, with no other changes whatsoever.

[~shinichiro abe], can you help Mr. Rath by trying MCF 2.12 with a fresh 
single-process Solr instance, using the "/update" handler?  He claims that this 
does not work and I do not have any time to work with him for the next few 
weeks.  If it works for you please provide detailed steps describing what you 
did.  Thanks in advance!


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: Document simple history.docx, Manifold and Solr 
> settings_CustomField.docx, managed-schema, manifold settings.docx, 
> manifoldcf.log, path.png, schema.png, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-19 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771682#comment-16771682
 ] 

Karl Wright commented on LUCENE-8696:
-

[~ivera], would you be willing to construct a simple test case?  I can't 
possibly look at this until the weekend, but it would help.

> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-02-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771663#comment-16771663
 ] 

Karl Wright commented on CONNECTORS-1563:
-

Hi Subasini,

Are you now Tika-extracting in ManifoldCF, or in Solr?
The text field looks like it contains properly extracted content, along with 
other stuff you do not want.  Is this correct?

If the extraction is happening in Solr, then I have no idea what this is coming 
from.  If the extraction is happening in ManifoldCF, then if you have placed a 
Metadata Adjuster transformer in the pipeline between the Tika Extractor and 
the Solr Output Connector, I'd say you had set it up to concatenate many fields 
together into a text field.  The Metadata Adjuster has that ability.

The choice of how metadata (or content) fields get mapped to Solr schema is set 
up in your Solr output connection configuration.  The Tika extraction basically 
replaces a binary input document with a character-sequence output document plus 
metadata fields.  The character-sequence output document then must be sent to 
Solr not using the exracting update handler, but just the standard handler, so 
the handler should be changed from /update/extract to just /update, and the 
"Use extracting update handler" should be turned off.  The actual field name 
used for the extracted content body can also be changed, if desired, in the 
"Schema" part of the configuration.  But what is there by default works with 
Solr as it's set up by default.





> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: Document simple history.docx, managed-schema, manifold 
> settings.docx, manifoldcf.log, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1584) regex documentation

2019-02-18 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1584.
-
Resolution: Not A Problem

> regex documentation
> ---
>
> Key: CONNECTORS-1584
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1584
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
>
> What type of regexs does manifold include and exclude support and also in 
> general regex support?
> At the moment i'm using a web repository connection and an Elastic output 
> connection.
>  I'm trying to exclude urls that link to documents.
>           e.g. website.com/document/path/this.pdf and 
> website.com/document/path/other.PDF
> The issue i'm having is that the regex that I have found so far doesn't work 
> case insensitive, so for every possible case i have to add a new line.
>     e.g.:
> {code:java}
> .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code}
> Is it possible to add documentation what type of regex is able to be used or 
> maybe a tool to test your regex and see if it is supported by manifold ?
> I tried mailing this question to 
> [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
> adress returns a failure notice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


<    1   2   3   4   5   6   7   8   9   10   >