[jira] [Updated] (CONNECTORS-1522) Add SSL trust certificates list to ElasticSearch output connector

2018-08-09 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1522:

Fix Version/s: ManifoldCF 2.12

> Add SSL trust certificates list to ElasticSearch output connector
> -
>
> Key: CONNECTORS-1522
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1522
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.12
>
>
> Add "SSL trust certificate list" to Elasticsearch output connector.
> Add User Id, Password functionality to ES output connector.
> Above as per SOLR output connector.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1522) Add SSL trust certificates list to ElasticSearch output connector

2018-08-09 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1522:
---

Assignee: Karl Wright

> Add SSL trust certificates list to ElasticSearch output connector
> -
>
> Key: CONNECTORS-1522
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1522
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Assignee: Karl Wright
>Priority: Minor
>
> Add "SSL trust certificate list" to Elasticsearch output connector.
> Add User Id, Password functionality to ES output connector.
> Above as per SOLR output connector.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2693) Tika 1.17 uses the wrong classloader for reflection

2018-08-09 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574706#comment-16574706
 ] 

Karl Wright commented on TIKA-2693:
---

Re: testing: I don't have a test setup here, and the user is not at liberty to 
send documents to me that cause the problem.   He says he's been in contact 
with someone on the Tika team and has replaced all four Tika jars with ones 
from nightly, and I think he replaced the POI jars with 4.0.0 ones earlier, but 
don't know if he retained that change, since that's not supposed to work 100%.  
Anyway, if he responds to my inquiry I will request he build new jars from the 
branch that Tim created and see if they all work.


> Tika 1.17 uses the wrong classloader for reflection
> ---
>
> Key: TIKA-2693
> URL: https://issues.apache.org/jira/browse/TIKA-2693
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.17
>    Reporter: Karl Wright
>Priority: Major
>
> I don't know whether this was addressed in 1.18, but Tika seemingly uses the 
> wrong classloader when loading some classes by reflection.
> In ManifoldCF, there's a two-tiered classloader hierarchy.  Tika runs in the 
> higher class level.  Its expectation is that classes that are loaded via 
> reflection use the classloader associated with the class that is resolving 
> the reflection, NOT the thread classloader.  That's standard Java practice.
> But apparently there's a place where Tika doesn't do it that way:
> {code}
> Error tossed: org/apache/poi/POIXMLTextExtractor
> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTextExtractor
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>  ~[?:?]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[?:?]
> at 
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>  ~[?:?]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1490) GSOC: MongoDB Output Connector

2018-08-09 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574676#comment-16574676
 ] 

Karl Wright commented on CONNECTORS-1490:
-

[~piergiorgioluc...@gmail.com], it ran correctly because you'd previously done 
a "mvn install" for ManifoldCF.


> GSOC: MongoDB Output Connector
> --
>
> Key: CONNECTORS-1490
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1490
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: MongoDB Output Connector
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: MongoDB, gsoc2018, java, junit
> Attachments: mcf-mongodb-connector(CONNECTORS-1490).patch, 
> mcf-mongodb-connector(CONNECTORS-1490)1.patch, 
> mongoDB-connectors-IT-OK-from-Ant.txt, 
> mongodb-output-connection-configuration.PNG
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to extend the Content Migration capabilities adding MongoDB / 
> GridFS as a new output connector for importing contents from one or more 
> repositories supported by ManifoldCF. In this way we will help developers on 
> migrating contents from different data sources on MongoDB.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write the connector implementation
>  * Implement unit tests
>  * Build all the integration tests for testing the connector inside the 
> framework
>  * Write the documentation for this connector
> We have a complete documentation on how to implement an Output Connector:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/writing-output-connectors.html]
> Take a look also at our book to understand better the framework and how to 
> implement connectors:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: crawl interrupted

2018-08-09 Thread Karl Wright
There is no autovacuum for MySQL.  MySQL apparently does dead tuple cleanup
as it goes.

Karl

On Thu, Aug 9, 2018 at 6:13 AM Gustavo Beneitez 
wrote:

> Hi,
>
> looking at the manifoldCF pom I can see
>
> 1.0.4-SNAPSHOT
>
> I'm not aware of any change in database, in fact ours is MySQL, I don't
> know if "auto_vacuum" property is present in MySQL installation.
>
> Thanks!
>
> El jue., 9 ago. 2018 a las 11:19, msaunier ()
> escribió:
>
>> Hi Gustavo,
>>
>>
>>
>> What is your ManifoldCF version?
>>
>> Do you have disabled auto_vacuum on your SQL configuration?
>>
>>
>>
>> Maxence,
>>
>>
>>
>>
>>
>>
>>
>> *De :* Gustavo Beneitez [mailto:gustavo.benei...@gmail.com]
>> *Envoyé :* jeudi 9 août 2018 11:17
>> *À :* user@manifoldcf.apache.org
>> *Objet :* crawl interrupted
>>
>>
>>
>> Hi all,
>>
>>
>>
>> The Manifold crawler just aborted his jobs and recordeda message in job
>> status:
>>
>>
>>
>>  Error: Unexpected jobqueue status - record id 1533799203323, expecting
>> active status, saw 2
>>
>>
>>
>> Do you know what does it mean? Maybe I have to ask for GC or catalina
>> logs.
>>
>>
>>
>> Thanks!
>>
>


[jira] [Assigned] (LUCENE-8451) GeoPolygon test failure

2018-08-09 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned LUCENE-8451:
---

Assignee: Karl Wright

> GeoPolygon test failure
> ---
>
> Key: LUCENE-8451
> URL: https://issues.apache.org/jira/browse/LUCENE-8451
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>    Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8451.patch
>
>
>  [junit4] Suite: org.apache.lucene.spatial3d.geom.RandomGeoPolygonTest
>    [junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=RandomGeoPolygonTest -Dtests.method=testCompareBigPolygons 
> -Dtests.seed=2C88B3DA273BE2DF -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=en-TC -Dtests.timezone=Europe/Budapest -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>    [junit4] FAILURE 0.01s J0 | RandomGeoPolygonTest.testCompareBigPolygons 
> \{seed=[2C88B3DA273BE2DF:5742535E2813B1BD]} <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: Polygon failed to 
> build with an exception:
>    [junit4]    > [[lat=1.5408708232037775E-28, lon=0.0([X=1.0011188539924791, 
> Y=0.0, Z=1.5425948326762136E-28])], [lat=-0.42051952071345244, 
> lon=-0.043956709579662245([X=0.912503274975597, Y=-0.04013649525500056, 
> Z=-0.40846219882801177])], [lat=0.6967302798374987, 
> lon=-1.5354076311466454([X=0.027128243251908137, Y=-0.7662593106632875, 
> Z=0.641541793498374])], [lat=0.6093302043457702, 
> lon=-1.5374202165648532([X=0.02736481119831758, Y=-0.8195876964154789, 
> Z=0.5723273145651325])], [lat=1.790840712772793E-12, 
> lon=4.742872761198669E-13([X=1.0011188539924791, Y=4.748179343323357E-13, 
> Z=1.792844402054173E-12])], [lat=-1.4523595845716656E-12, 
> lon=9.592326932761353E-13([X=1.0011188539924791, Y=9.603059346047237E-13, 
> Z=-1.4539845628913788E-12])], [lat=0.29556330360208455, 
> lon=1.5414988021120735([X=0.02804645884597515, Y=0.957023986775941, 
> Z=0.2915213382500179])]]
>    [junit4]    > WKT:POLYGON((-2.5185339401969213 -24.093993739745027,0.0 
> 8.828539494442529E-27,5.495998489568957E-11 
> -8.321407453133E-11,2.7174659198424288E-11 
> 1.0260761462208114E-10,88.32137548549387 
> 16.934529875343248,-87.97237709688223 39.91970449365747,-88.0876897472551 
> 34.91204903885665,-2.5185339401969213 -24.093993739745027))
>    [junit4]    > java.lang.IllegalArgumentException: Convex polygon has a 
> side that is more than 180 degrees
>    [junit4]    >        at 
> __randomizedtesting.SeedInfo.seed([2C88B3DA273BE2DF:5742535E2813B1BD]:0)
>    [junit4]    >        at 
> org.apache.lucene.spatial3d.geom.RandomGeoPolygonTest.testComparePolygons(RandomGeoPolygonTest.java:163)
>    [junit4]    >        at 
> org.apache.lucene.spatial3d.geom.RandomGeoPolygonTest.testCompareBigPolygons(RandomGeoPolygonTest.java:98)
>    [junit4]    >        at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    [junit4]    >        at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>    [junit4]    >        at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    [junit4]    >        at 
> java.base/java.lang.reflect.Method.invoke(Method.java:564)
>    [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:844)
>    [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {}, 
> docValues:{}, maxPointsInLeafNode=1403, maxMBSortInHeap=5.306984579448146, 
> sim=RandomSimilarity(queryNorm=false): {}, locale=en-TC, 
> timezone=Europe/Budapest
>    [junit4]   2> NOTE: Linux 4.15.0-29-generic amd64/Oracle Corporation 9.0.4 
> (64-bit)/cpus=8,threads=1,free=296447064,total=536870912
>    [junit4]   2> NOTE: All tests run in this JVM: [GeoPointTest, 
> GeoExactCircleTest, TestGeo3DDocValues, RandomGeoPolygonTest]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8451) GeoPolygon test failure

2018-08-09 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574582#comment-16574582
 ] 

Karl Wright commented on LUCENE-8451:
-

[~ivera], I won't have any possibility of looking at this until Saturday.

It looks like a classic tiling problem.  Not all selections of polygon points 
can be tiled.  Perhaps the edges cross?


> GeoPolygon test failure
> ---
>
> Key: LUCENE-8451
> URL: https://issues.apache.org/jira/browse/LUCENE-8451
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8451.patch
>
>
>  [junit4] Suite: org.apache.lucene.spatial3d.geom.RandomGeoPolygonTest
>    [junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=RandomGeoPolygonTest -Dtests.method=testCompareBigPolygons 
> -Dtests.seed=2C88B3DA273BE2DF -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=en-TC -Dtests.timezone=Europe/Budapest -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>    [junit4] FAILURE 0.01s J0 | RandomGeoPolygonTest.testCompareBigPolygons 
> \{seed=[2C88B3DA273BE2DF:5742535E2813B1BD]} <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: Polygon failed to 
> build with an exception:
>    [junit4]    > [[lat=1.5408708232037775E-28, lon=0.0([X=1.0011188539924791, 
> Y=0.0, Z=1.5425948326762136E-28])], [lat=-0.42051952071345244, 
> lon=-0.043956709579662245([X=0.912503274975597, Y=-0.04013649525500056, 
> Z=-0.40846219882801177])], [lat=0.6967302798374987, 
> lon=-1.5354076311466454([X=0.027128243251908137, Y=-0.7662593106632875, 
> Z=0.641541793498374])], [lat=0.6093302043457702, 
> lon=-1.5374202165648532([X=0.02736481119831758, Y=-0.8195876964154789, 
> Z=0.5723273145651325])], [lat=1.790840712772793E-12, 
> lon=4.742872761198669E-13([X=1.0011188539924791, Y=4.748179343323357E-13, 
> Z=1.792844402054173E-12])], [lat=-1.4523595845716656E-12, 
> lon=9.592326932761353E-13([X=1.0011188539924791, Y=9.603059346047237E-13, 
> Z=-1.4539845628913788E-12])], [lat=0.29556330360208455, 
> lon=1.5414988021120735([X=0.02804645884597515, Y=0.957023986775941, 
> Z=0.2915213382500179])]]
>    [junit4]    > WKT:POLYGON((-2.5185339401969213 -24.093993739745027,0.0 
> 8.828539494442529E-27,5.495998489568957E-11 
> -8.321407453133E-11,2.7174659198424288E-11 
> 1.0260761462208114E-10,88.32137548549387 
> 16.934529875343248,-87.97237709688223 39.91970449365747,-88.0876897472551 
> 34.91204903885665,-2.5185339401969213 -24.093993739745027))
>    [junit4]    > java.lang.IllegalArgumentException: Convex polygon has a 
> side that is more than 180 degrees
>    [junit4]    >        at 
> __randomizedtesting.SeedInfo.seed([2C88B3DA273BE2DF:5742535E2813B1BD]:0)
>    [junit4]    >        at 
> org.apache.lucene.spatial3d.geom.RandomGeoPolygonTest.testComparePolygons(RandomGeoPolygonTest.java:163)
>    [junit4]    >        at 
> org.apache.lucene.spatial3d.geom.RandomGeoPolygonTest.testCompareBigPolygons(RandomGeoPolygonTest.java:98)
>    [junit4]    >        at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    [junit4]    >        at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>    [junit4]    >        at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    [junit4]    >        at 
> java.base/java.lang.reflect.Method.invoke(Method.java:564)
>    [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:844)
>    [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {}, 
> docValues:{}, maxPointsInLeafNode=1403, maxMBSortInHeap=5.306984579448146, 
> sim=RandomSimilarity(queryNorm=false): {}, locale=en-TC, 
> timezone=Europe/Budapest
>    [junit4]   2> NOTE: Linux 4.15.0-29-generic amd64/Oracle Corporation 9.0.4 
> (64-bit)/cpus=8,threads=1,free=296447064,total=536870912
>    [junit4]   2> NOTE: All tests run in this JVM: [GeoPointTest, 
> GeoExactCircleTest, TestGeo3DDocValues, RandomGeoPolygonTest]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (CONNECTORS-1490) GSOC: MongoDB Output Connector

2018-08-09 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574570#comment-16574570
 ] 

Karl Wright commented on CONNECTORS-1490:
-

Hi [~piergiorgioluc...@gmail.com], we have to rethink this.

I executed the following steps:

{code}
ant make-core-deps make-deps
ant test
{code}

This fails because of the following:

{code}
 [exec] [ERROR] Failed to execute goal on project mcf-mongodb-connector: 
Could not resolve dependencies for project 
org.apache.manifoldcf:mcf-mongodb-connector:jar:2.11-SNAPSHOT: The following 
artifacts could not be resolved: 
org.apache.manifoldcf:mcf-core:jar:2.11-SNAPSHOT, 
org.apache.manifoldcf:mcf-connector-common:jar:2.11-SNAPSHOT, 
org.apache.manifoldcf:mcf-agents:jar:2.11-SNAPSHOT, 
org.apache.manifoldcf:mcf-pull-agent:jar:2.11-SNAPSHOT, 
org.apache.manifoldcf:mcf-ui-core:jar:2.11-SNAPSHOT, 
org.apache.manifoldcf:mcf-core:jar:tests:2.11-SNAPSHOT, 
org.apache.manifoldcf:mcf-agents:jar:tests:2.11-SNAPSHOT, 
org.apache.manifoldcf:mcf-pull-agent:jar:tests:2.11-SNAPSHOT, 
org.apache.manifoldcf:mcf-api-service:war:2.11-SNAPSHOT, 
org.apache.manifoldcf:mcf-authority-service:war:2.11-SNAPSHOT, 
org.apache.manifoldcf:mcf-crawler-ui:war:2.11-SNAPSHOT: Could not find artifact 
org.apache.manifoldcf:mcf-core:jar:2.11-SNAPSHOT in sonatype-repo 
(http://oss.sonatype.org/content/repositories/snapshots) -> [Help 1]
{code}

This is obviously because it's still shelling out to Maven, and it's expecting 
the maven build to have been run first.  We cannot insure that, and committing 
a native ant build seems unreasonable because there are literally hundreds of 
dependencies mongodb brings in for testing that we'd have to all download via 
ant.

So it seems to me there are two choices.  First choice is to simply not run any 
Mongodb integration tests under Ant, and only run them under Maven.  The second 
choice is to revamp the ManifoldCF ant build to use ivy instead of manual 
dependency resolution.  The second approach is problematic too, though, because 
we'd still be distribution a much much larger lib distribution.  I don't know 
how much larger.  We'd also need to figure out how to build a lib distribution 
since we'd effectively be replacing the "lib" directory with ivy support.

For now I therefore think the only possibility is disabling the Mongodb 
integration tests under Ant.  Can you do that?





> GSOC: MongoDB Output Connector
> --
>
> Key: CONNECTORS-1490
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1490
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: MongoDB Output Connector
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: MongoDB, gsoc2018, java, junit
> Attachments: mcf-mongodb-connector(CONNECTORS-1490).patch, 
> mcf-mongodb-connector(CONNECTORS-1490)1.patch, 
> mongoDB-connectors-IT-OK-from-Ant.txt, 
> mongodb-output-connection-configuration.PNG
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to extend the Content Migration capabilities adding MongoDB / 
> GridFS as a new output connector for importing contents from one or more 
> repositories supported by ManifoldCF. In this way we will help developers on 
> migrating contents from different data sources on MongoDB.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write the connector implementation
>  * Implement unit tests
>  * Build all the integration tests for testing the connector inside the 
> framework
>  * Write the documentation for this connector
> We have a complete documentation on how to implement an Output Connector:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/writing-output-connectors.html]
> Take a look also at our book to understand better the framework and how to 
> implement connectors:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1490) GSOC: MongoDB Output Connector

2018-08-09 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574552#comment-16574552
 ] 

Karl Wright commented on CONNECTORS-1490:
-

Ok, thanks.  I'm going to try running the IT from ant here and see if they run 
natively.


> GSOC: MongoDB Output Connector
> --
>
> Key: CONNECTORS-1490
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1490
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: MongoDB Output Connector
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: MongoDB, gsoc2018, java, junit
> Attachments: mcf-mongodb-connector(CONNECTORS-1490).patch, 
> mcf-mongodb-connector(CONNECTORS-1490)1.patch, 
> mongoDB-connectors-IT-OK-from-Ant.txt, 
> mongodb-output-connection-configuration.PNG
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to extend the Content Migration capabilities adding MongoDB / 
> GridFS as a new output connector for importing contents from one or more 
> repositories supported by ManifoldCF. In this way we will help developers on 
> migrating contents from different data sources on MongoDB.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write the connector implementation
>  * Implement unit tests
>  * Build all the integration tests for testing the connector inside the 
> framework
>  * Write the documentation for this connector
> We have a complete documentation on how to implement an Output Connector:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/writing-output-connectors.html]
> Take a look also at our book to understand better the framework and how to 
> implement connectors:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2693) Tika 1.17 uses the wrong classloader for reflection

2018-08-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573935#comment-16573935
 ] 

Karl Wright commented on TIKA-2693:
---

I am currently with my wife in the emergency room, so trying things out will 
take a while.

We have agreed to hold our release until this code ships.  For clients that 
need the fix before that I am happy to build from your branch.  

> Tika 1.17 uses the wrong classloader for reflection
> ---
>
> Key: TIKA-2693
> URL: https://issues.apache.org/jira/browse/TIKA-2693
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.17
>    Reporter: Karl Wright
>Priority: Major
>
> I don't know whether this was addressed in 1.18, but Tika seemingly uses the 
> wrong classloader when loading some classes by reflection.
> In ManifoldCF, there's a two-tiered classloader hierarchy.  Tika runs in the 
> higher class level.  Its expectation is that classes that are loaded via 
> reflection use the classloader associated with the class that is resolving 
> the reflection, NOT the thread classloader.  That's standard Java practice.
> But apparently there's a place where Tika doesn't do it that way:
> {code}
> Error tossed: org/apache/poi/POIXMLTextExtractor
> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTextExtractor
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>  ~[?:?]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[?:?]
> at 
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>  ~[?:?]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: It's release time again

2018-08-08 Thread Karl Wright
It seems that Tika 1.19 fixes many of the Tika-related issues and is due in
a couple of weeks.  I'm therefore going to recommend holding off until we
get 1.19, or until we hear that it's out of the question, before we do a
ManifoldCF release, since many of our outstanding issues relate to Tika and
Apache POI.

Thoughts?

Karl


On Tue, Aug 7, 2018 at 8:27 AM Piergiorgio Lucidi 
wrote:

> It's ok for me,
>
> I wanted to finish several things but unfortunately I will not have enough
> time.
> The MongoDB Output Connector implemented by Irindu will be merged in our
> branch during this week but I don't know if we have time to directly bring
> it in this release.
>
> In September I hope to bring the new website and Alfresco BFSI and then the
> Azure Storage connectors.
>
> Cheers,
> PJ
>
> Il giorno mar 7 ago 2018 alle ore 14:05 Karl Wright 
> ha
> scritto:
>
> > When will it be ready for integration?
> > Karl
> >
> > On Tue, Aug 7, 2018 at 7:10 AM Irindu Nugawela 
> > wrote:
> >
> > > Hi Karl,
> > >
> > > I am currently preparing the patch for mcf-mongodb-output-connector. I
> > > would be glad if we can include it in the next release.
> > >
> > > On Mon, 6 Aug 2018 at 16:00, Karl Wright  wrote:
> > >
> > > > I'm hoping to cut RC0 of 2.11 around August 15th. Any objection?
> > > >
> > > > Karl
> > > >
> > >
> > >
> > > --
> > > Thanks and Regards,
> > > Irindu Nugawela,
> > > Computer Engineering <http://www.ce.pdn.ac.lk/> Undergraduate,
> > > Faculty of Engineering University of Peradeniya
> > >
> >
> > --
> > Piergiorgio
> >
>


[jira] [Commented] (TIKA-2693) Tika 1.17 uses the wrong classloader for reflection

2018-08-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573198#comment-16573198
 ] 

Karl Wright commented on TIKA-2693:
---

I am being clobbered with Tika/POI issues at the moment so I'm going to 
recommend holding off shipping MCF 2.11 until we either have a 1.19 Tika, or we 
know we're not going to get it soon.  Hopefully that helps. ;-)

> Tika 1.17 uses the wrong classloader for reflection
> ---
>
> Key: TIKA-2693
> URL: https://issues.apache.org/jira/browse/TIKA-2693
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.17
>    Reporter: Karl Wright
>Priority: Major
>
> I don't know whether this was addressed in 1.18, but Tika seemingly uses the 
> wrong classloader when loading some classes by reflection.
> In ManifoldCF, there's a two-tiered classloader hierarchy.  Tika runs in the 
> higher class level.  Its expectation is that classes that are loaded via 
> reflection use the classloader associated with the class that is resolving 
> the reflection, NOT the thread classloader.  That's standard Java practice.
> But apparently there's a place where Tika doesn't do it that way:
> {code}
> Error tossed: org/apache/poi/POIXMLTextExtractor
> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTextExtractor
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>  ~[?:?]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[?:?]
> at 
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>  ~[?:?]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Job stuck internal http error 500

2018-08-08 Thread Karl Wright
Thanks for the update!

Did the Tika people say when 1.19 will be released?

Karl


On Wed, Aug 8, 2018 at 8:29 AM Bisonti Mario 
wrote:

> Hallo
>
> You had right, Karl.
>
>
>
> I have been helped by the tika people and they patched the tika jar of the
> solr installation and the problem was solved!
>
>
>
> Now I solved using the tika 1.19 versions nightly build.
>
>
>
>
>
> Thanks a lot.
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* venerdì 27 luglio 2018 12:39
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck internal http error 500
>
>
>
> I am afraid you will need to open a Tika ticket, and be prepared to attach
> your file to it.
>
>
>
> Thanks,
>
>
>
> Karl
>
>
>
>
>
> On Fri, Jul 27, 2018 at 6:04 AM Bisonti Mario 
> wrote:
>
> It isn’t a memory problem because xls file bigger (30MB) have been
> processed.
>
>
>
> This file xlsm with many colors etc hang
>
> I could suppose that it is a tika/solr erro but I don’t know how to solve
> it
>
> ☹
>
>
>
> *Oggetto:* R: Job stuck internal http error 500
>
>
>
> Yes, I am using:
> /opt/manifoldcf/multiprocess-file-example-proprietary
> I set:
>
> sudo nano options.env.unix
>
> -Xms2048m
>
> -Xmx2048m
>
>
>
> But I obtain the same error.
>
> My doubt is that it could be a solr/tika problem.
>
> What could I do?
>
> I restrict the scan to a single file and I obtain the same error
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* venerdì 27 luglio 2018 11:36
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck internal http error 500
>
>
>
> I am presuming you are using the examples.  If so, edit the options file
> to grant more memory to you agents process by increasing the Xmx value.
>
>
>
> Karl
>
>
>
> On Fri, Jul 27, 2018, 3:04 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
> My job is stucking indexing an xlsx file of 38MB
>
>
>
> What could I do to solve my problem?
>
>
>
> In the following there is the error:
> 2018-07-27 08:55:15.562 WARN  (qtp1521083627-52) [   x:core_share]
> o.e.j.s.HttpChannel /solr/core_share/update/extract
>
> java.lang.OutOfMemoryError
>
> at
> java.base/java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:188)
>
> at
> java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:180)
>
> at
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:147)
>
> at
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:660)
>
> at java.base/java.lang.StringBuilder.append(StringBuilder.java:195)
>
> at
> org.apache.solr.handler.extraction.SolrContentHandler.characters(SolrContentHandler.java:302)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>
> at
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>
> at
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>
> at
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLTikaBodyPartHandler.run(OOXMLTikaBodyPartHandler.java:147)
>
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler.handleEndOfRun(OOXMLWordAndPowerPointTextHandler.java:468)
>
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler.endElement(OOXMLWordAndPowerPointTextHandler.java:450)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>
> a

[jira] [Commented] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones

2018-08-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573002#comment-16573002
 ] 

Karl Wright commented on CONNECTORS-1521:
-

There is one hacky approach that would certainly work, which is to deduct 24 
hours from the actual time the MCF framework gives the connector.  This 
effectively assumes we don't know what timezone the DCTM server is and will 
re-examine all documents within the last 24 hours accordingly.  It's an ugly 
hack and it will mean processing far more documents than is efficient, but I 
can see no other way, unless there is somebody who can give us a DQL query 
clause that will properly take timezone into account.

> Documentum Connector users ManifoldCF's local time in queries constraints 
> against the Documentum server without reference to time zones
> ---
>
> Key: CONNECTORS-1521
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1521
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
>
> I find that the time/date constraints in queries to the Documentum server are 
> based on the "raw" local time of the ManifoldCF server but appear to take no 
> account of the time zones of the two servers.
> This can lead to recently modified files not being transferred to the output 
> repository when you would naturally expect them to be. I'd like the times to 
> be aligned, perhaps by including time zone in the query. In particular, is 
> there a way to use UTC perhaps?
> Here's an example ...
>  * create a folder in Documentum
>  * set up a job to point at the folder and output to the file system
>  * put two documents into a folder in Documentum
>  * Select them, right click and export as CSV (to show the timestamps):
> {noformat}
> 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM,
> 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat}
> Check the local time on the ManifoldCF server machine. Observe that it's 
> reporting consistent time with the DM server:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 09:07:25 BST 2018{noformat}
> Start the job and look for the query to Documentum in the manifoldcf.log file 
> (line break added for readability):
> {noformat}
> DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and
> r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') 
> AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) ))
> ^C{noformat}
> Notice that the latest date asked for is *before* the modification date of 
> the files added to DM. (And is an hour out, see footnote.)
>   
>  See whether anything has been output by the File System connector. It hasn't:
> {noformat}
> [james@manifold]$ ls /bigdisc/source/PDFs/timezones/
> [james@manifold]$
> {noformat}
> Now:
>  * change the timezone on the ManifoldCF server machine
>  * restart the ManifoldCF server and the Documentum processes
>  * reseed the job
> Check the local time on the ManifoldCF server machine; it has changed:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 10:10:29 CEST 2018{noformat}
> Start the job again and notice that the query has changed by an hour, plus 
> the few minutes it took to change the date etc (and is still an hour out, see 
> footnote):
> {noformat}
> r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') 
> {noformat}
> Observe that the range of dates now covers the timestamps on the DM data, and 
> also that some data has now been transferred by the File System connector:
> {noformat}
> [james@manifold]$ ls 
> /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/
> drl?versionLabel=CURRENT=09018000e515
> drl?versionLabel=CURRENT=09018000e516
> {noformat}
>  
>  
> [Footnote] It appears that something is trying to take account of Daylight 
> Saving Time too.
> If I set the server date to a time outside of DST, the query is aligned with 
> the current time:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Mon Oct 29 00:01:13 CET 2018
> r_modify_date<=date('10/29/2018 00:01:39','mm/dd/ hh:mi:ss') 
> {noformat}
> But if I set the time 

[jira] [Commented] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones

2018-08-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572995#comment-16572995
 ] 

Karl Wright commented on CONNECTORS-1521:
-

I'm afraid I don't have time to even contemplate how something like that might 
be done.

No, there must be a way that people do this.  

At worst, you can make the timezone of MCF be the same as the timezone of the 
server.  But this connector was developed originally by a Documentum 
consultant, and it's his code, so I presume that he knew what he was doing.  
Maybe there actually isn't a way.

Do you have the ability to call support at Documentum?



> Documentum Connector users ManifoldCF's local time in queries constraints 
> against the Documentum server without reference to time zones
> ---
>
> Key: CONNECTORS-1521
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1521
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
>
> I find that the time/date constraints in queries to the Documentum server are 
> based on the "raw" local time of the ManifoldCF server but appear to take no 
> account of the time zones of the two servers.
> This can lead to recently modified files not being transferred to the output 
> repository when you would naturally expect them to be. I'd like the times to 
> be aligned, perhaps by including time zone in the query. In particular, is 
> there a way to use UTC perhaps?
> Here's an example ...
>  * create a folder in Documentum
>  * set up a job to point at the folder and output to the file system
>  * put two documents into a folder in Documentum
>  * Select them, right click and export as CSV (to show the timestamps):
> {noformat}
> 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM,
> 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat}
> Check the local time on the ManifoldCF server machine. Observe that it's 
> reporting consistent time with the DM server:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 09:07:25 BST 2018{noformat}
> Start the job and look for the query to Documentum in the manifoldcf.log file 
> (line break added for readability):
> {noformat}
> DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and
> r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') 
> AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) ))
> ^C{noformat}
> Notice that the latest date asked for is *before* the modification date of 
> the files added to DM. (And is an hour out, see footnote.)
>   
>  See whether anything has been output by the File System connector. It hasn't:
> {noformat}
> [james@manifold]$ ls /bigdisc/source/PDFs/timezones/
> [james@manifold]$
> {noformat}
> Now:
>  * change the timezone on the ManifoldCF server machine
>  * restart the ManifoldCF server and the Documentum processes
>  * reseed the job
> Check the local time on the ManifoldCF server machine; it has changed:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 10:10:29 CEST 2018{noformat}
> Start the job again and notice that the query has changed by an hour, plus 
> the few minutes it took to change the date etc (and is still an hour out, see 
> footnote):
> {noformat}
> r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') 
> {noformat}
> Observe that the range of dates now covers the timestamps on the DM data, and 
> also that some data has now been transferred by the File System connector:
> {noformat}
> [james@manifold]$ ls 
> /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/
> drl?versionLabel=CURRENT=09018000e515
> drl?versionLabel=CURRENT=09018000e516
> {noformat}
>  
>  
> [Footnote] It appears that something is trying to take account of Daylight 
> Saving Time too.
> If I set the server date to a time outside of DST, the query is aligned with 
> the current time:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Mon Oct 29 00:01:13 CET 2018
> r_modify_date<=date('10/29/2018 00:01:39','mm/dd/ hh:mi:ss') 
> {noformat}
> But if I set the time inside DST, the time is an hour before:
> {noforma

[jira] [Commented] (CONNECTORS-1490) GSOC: MongoDB Output Connector

2018-08-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572690#comment-16572690
 ] 

Karl Wright commented on CONNECTORS-1490:
-

The only remaining issue is how the tests are run.  They should be invoked 
directly using the ant build, not shelled out using Maven.

> GSOC: MongoDB Output Connector
> --
>
> Key: CONNECTORS-1490
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1490
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: MongoDB Output Connector
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: MongoDB, gsoc2018, java, junit
> Attachments: mcf-mongodb-connector(CONNECTORS-1490).patch, 
> mcf-mongodb-connector(CONNECTORS-1490)1.patch, 
> mongodb-output-connection-configuration.PNG
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to extend the Content Migration capabilities adding MongoDB / 
> GridFS as a new output connector for importing contents from one or more 
> repositories supported by ManifoldCF. In this way we will help developers on 
> migrating contents from different data sources on MongoDB.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write the connector implementation
>  * Implement unit tests
>  * Build all the integration tests for testing the connector inside the 
> framework
>  * Write the documentation for this connector
> We have a complete documentation on how to implement an Output Connector:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/writing-output-connectors.html]
> Take a look also at our book to understand better the framework and how to 
> implement connectors:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1490) GSOC: MongoDB Output Connector

2018-08-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572688#comment-16572688
 ] 

Karl Wright commented on CONNECTORS-1490:
-

ok, moved.


> GSOC: MongoDB Output Connector
> --
>
> Key: CONNECTORS-1490
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1490
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: MongoDB Output Connector
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: MongoDB, gsoc2018, java, junit
> Attachments: mcf-mongodb-connector(CONNECTORS-1490).patch, 
> mcf-mongodb-connector(CONNECTORS-1490)1.patch, 
> mongodb-output-connection-configuration.PNG
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to extend the Content Migration capabilities adding MongoDB / 
> GridFS as a new output connector for importing contents from one or more 
> repositories supported by ManifoldCF. In this way we will help developers on 
> migrating contents from different data sources on MongoDB.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write the connector implementation
>  * Implement unit tests
>  * Build all the integration tests for testing the connector inside the 
> framework
>  * Write the documentation for this connector
> We have a complete documentation on how to implement an Output Connector:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/writing-output-connectors.html]
> Take a look also at our book to understand better the framework and how to 
> implement connectors:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1490) GSOC: MongoDB Output Connector

2018-08-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571798#comment-16571798
 ] 

Karl Wright commented on CONNECTORS-1490:
-

Also, build.xml has the following:

{code}



























{code}

This is leveraging (and requiring!) a maven setup to run the actual IT tests 
for this connector.  That's not something we can support, since the ant build 
is primary here.


> GSOC: MongoDB Output Connector
> --
>
> Key: CONNECTORS-1490
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1490
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: MongoDB Output Connector
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: MongoDB, gsoc2018, java, junit
> Attachments: mcf-mongodb-connector(CONNECTORS-1490).patch, 
> mongodb-output-connection-configuration.PNG
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to extend the Content Migration capabilities adding MongoDB / 
> GridFS as a new output connector for importing contents from one or more 
> repositories supported by ManifoldCF. In this way we will help developers on 
> migrating contents from different data sources on MongoDB.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write the connector implementation
>  * Implement unit tests
>  * Build all the integration tests for testing the connector inside the 
> framework
>  * Write the documentation for this connector
> We have a complete documentation on how to implement an Output Connector:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/writing-output-connectors.html]
> Take a look also at our book to understand better the framework and how to 
> implement connectors:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1490) GSOC: MongoDB Output Connector

2018-08-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571787#comment-16571787
 ] 

Karl Wright commented on CONNECTORS-1490:
-

What was the final decision about what version of the jar that this connector 
should depend upon?


> GSOC: MongoDB Output Connector
> --
>
> Key: CONNECTORS-1490
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1490
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: MongoDB Output Connector
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: MongoDB, gsoc2018, java, junit
> Attachments: mcf-mongodb-connector(CONNECTORS-1490).patch, 
> mongodb-output-connection-configuration.PNG
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to extend the Content Migration capabilities adding MongoDB / 
> GridFS as a new output connector for importing contents from one or more 
> repositories supported by ManifoldCF. In this way we will help developers on 
> migrating contents from different data sources on MongoDB.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write the connector implementation
>  * Implement unit tests
>  * Build all the integration tests for testing the connector inside the 
> framework
>  * Write the documentation for this connector
> We have a complete documentation on how to implement an Output Connector:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/writing-output-connectors.html]
> Take a look also at our book to understand better the framework and how to 
> implement connectors:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones

2018-08-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571629#comment-16571629
 ] 

Karl Wright commented on CONNECTORS-1521:
-

{quote}
As far as I can see none of the patterns gives anything related to time zone or 
locale, or mentions UTC, DST etc.
{quote}

If there's a pattern with a "Z" at the end or in the middle, that usually 
stands for "Zulu" and is UTC.  Are there any such patterns?


> Documentum Connector users ManifoldCF's local time in queries constraints 
> against the Documentum server without reference to time zones
> ---
>
> Key: CONNECTORS-1521
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1521
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>    Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
>
> I find that the time/date constraints in queries to the Documentum server are 
> based on the "raw" local time of the ManifoldCF server but appear to take no 
> account of the time zones of the two servers.
> This can lead to recently modified files not being transferred to the output 
> repository when you would naturally expect them to be. I'd like the times to 
> be aligned, perhaps by including time zone in the query. In particular, is 
> there a way to use UTC perhaps?
> Here's an example ...
>  * create a folder in Documentum
>  * set up a job to point at the folder and output to the file system
>  * put two documents into a folder in Documentum
>  * Select them, right click and export as CSV (to show the timestamps):
> {noformat}
> 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM,
> 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat}
> Check the local time on the ManifoldCF server machine. Observe that it's 
> reporting consistent time with the DM server:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 09:07:25 BST 2018{noformat}
> Start the job and look for the query to Documentum in the manifoldcf.log file 
> (line break added for readability):
> {noformat}
> DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and
> r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') 
> AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) ))
> ^C{noformat}
> Notice that the latest date asked for is *before* the modification date of 
> the files added to DM. (And is an hour out, see footnote.)
>   
>  See whether anything has been output by the File System connector. It hasn't:
> {noformat}
> [james@manifold]$ ls /bigdisc/source/PDFs/timezones/
> [james@manifold]$
> {noformat}
> Now:
>  * change the timezone on the ManifoldCF server machine
>  * restart the ManifoldCF server and the Documentum processes
>  * reseed the job
> Check the local time on the ManifoldCF server machine; it has changed:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 10:10:29 CEST 2018{noformat}
> Start the job again and notice that the query has changed by an hour, plus 
> the few minutes it took to change the date etc (and is still an hour out, see 
> footnote):
> {noformat}
> r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') 
> {noformat}
> Observe that the range of dates now covers the timestamps on the DM data, and 
> also that some data has now been transferred by the File System connector:
> {noformat}
> [james@manifold]$ ls 
> /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/
> drl?versionLabel=CURRENT=09018000e515
> drl?versionLabel=CURRENT=09018000e516
> {noformat}
>  
>  
> [Footnote] It appears that something is trying to take account of Daylight 
> Saving Time too.
> If I set the server date to a time outside of DST, the query is aligned with 
> the current time:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Mon Oct 29 00:01:13 CET 2018
> r_modify_date<=date('10/29/2018 00:01:39','mm/dd/ hh:mi:ss') 
> {noformat}
> But if I set the time inside DST, the time is an hour before:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Sat Oct 27 00:00:06 CEST 2018
> r_modify_date<=date('10/26/2018 23:00:26','mm/dd/ hh:mi:ss') 
> {noformat}
> This is perhaps a Java issue rather than a logic issue in the connector? See 
> e.g. [https://stackoverflow.com/questions/6392/java-time-zone-is-messed-up]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones

2018-08-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571624#comment-16571624
 ] 

Karl Wright commented on CONNECTORS-1521:
-

[~jamesthomas], computing a date relative to "now" is trivial but it's not 
clear how you form a DQL expression involving r_modify_date with this.

The StackOverflow article is about how to get DFC to return dates in forms that 
are parseable by C Sharp, which is not germane here either, because we want 
dates that are in a form we can use for comparison.




> Documentum Connector users ManifoldCF's local time in queries constraints 
> against the Documentum server without reference to time zones
> ---
>
> Key: CONNECTORS-1521
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1521
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
>
> I find that the time/date constraints in queries to the Documentum server are 
> based on the "raw" local time of the ManifoldCF server but appear to take no 
> account of the time zones of the two servers.
> This can lead to recently modified files not being transferred to the output 
> repository when you would naturally expect them to be. I'd like the times to 
> be aligned, perhaps by including time zone in the query. In particular, is 
> there a way to use UTC perhaps?
> Here's an example ...
>  * create a folder in Documentum
>  * set up a job to point at the folder and output to the file system
>  * put two documents into a folder in Documentum
>  * Select them, right click and export as CSV (to show the timestamps):
> {noformat}
> 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM,
> 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat}
> Check the local time on the ManifoldCF server machine. Observe that it's 
> reporting consistent time with the DM server:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 09:07:25 BST 2018{noformat}
> Start the job and look for the query to Documentum in the manifoldcf.log file 
> (line break added for readability):
> {noformat}
> DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and
> r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') 
> AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) ))
> ^C{noformat}
> Notice that the latest date asked for is *before* the modification date of 
> the files added to DM. (And is an hour out, see footnote.)
>   
>  See whether anything has been output by the File System connector. It hasn't:
> {noformat}
> [james@manifold]$ ls /bigdisc/source/PDFs/timezones/
> [james@manifold]$
> {noformat}
> Now:
>  * change the timezone on the ManifoldCF server machine
>  * restart the ManifoldCF server and the Documentum processes
>  * reseed the job
> Check the local time on the ManifoldCF server machine; it has changed:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 10:10:29 CEST 2018{noformat}
> Start the job again and notice that the query has changed by an hour, plus 
> the few minutes it took to change the date etc (and is still an hour out, see 
> footnote):
> {noformat}
> r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') 
> {noformat}
> Observe that the range of dates now covers the timestamps on the DM data, and 
> also that some data has now been transferred by the File System connector:
> {noformat}
> [james@manifold]$ ls 
> /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/
> drl?versionLabel=CURRENT=09018000e515
> drl?versionLabel=CURRENT=09018000e516
> {noformat}
>  
>  
> [Footnote] It appears that something is trying to take account of Daylight 
> Saving Time too.
> If I set the server date to a time outside of DST, the query is aligned with 
> the current time:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Mon Oct 29 00:01:13 CET 2018
> r_modify_date<=date('10/29/2018 00:01:39','mm/dd/ hh:mi:ss') 
> {noformat}
> But if I set the time inside DST, the time is an hour before:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Sat Oct 27 00:00:06 CEST 2018
> r_modify_date<=date('10/26/2018 23:00:26','mm/dd/ hh:mi:ss') 
> {noformat}
> This is perhaps a Java issue rather than a logic issue in the connector? See 
> e.g. [https://stackoverflow.com/questions/6392/java-time-zone-is-messed-up]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: It's release time again

2018-08-07 Thread Karl Wright
When will it be ready for integration?
Karl

On Tue, Aug 7, 2018 at 7:10 AM Irindu Nugawela  wrote:

> Hi Karl,
>
> I am currently preparing the patch for mcf-mongodb-output-connector. I
> would be glad if we can include it in the next release.
>
> On Mon, 6 Aug 2018 at 16:00, Karl Wright  wrote:
>
> > I'm hoping to cut RC0 of 2.11 around August 15th. Any objection?
> >
> > Karl
> >
>
>
> --
> Thanks and Regards,
> Irindu Nugawela,
> Computer Engineering <http://www.ce.pdn.ac.lk/> Undergraduate,
> Faculty of Engineering University of Peradeniya
>


[jira] [Comment Edited] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones

2018-08-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571462#comment-16571462
 ] 

Karl Wright edited comment on CONNECTORS-1521 at 8/7/18 11:05 AM:
--

All I have access to indicates that IDfTime patterns are very iffy:

https://msroth.wordpress.com/2010/12/12/dftime-patterns/

... and I don't see any sign of UTC dates or timezones.




was (Author: kwri...@metacarta.com):
All I have access to indicates that IDfTime patterns are very iffy:

https://msroth.wordpress.com/2010/12/12/dftime-patterns/



> Documentum Connector users ManifoldCF's local time in queries constraints 
> against the Documentum server without reference to time zones
> ---
>
> Key: CONNECTORS-1521
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1521
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
>
> I find that the time/date constraints in queries to the Documentum server are 
> based on the "raw" local time of the ManifoldCF server but appear to take no 
> account of the time zones of the two servers.
> This can lead to recently modified files not being transferred to the output 
> repository when you would naturally expect them to be. I'd like the times to 
> be aligned, perhaps by including time zone in the query. In particular, is 
> there a way to use UTC perhaps?
> Here's an example ...
>  * create a folder in Documentum
>  * set up a job to point at the folder and output to the file system
>  * put two documents into a folder in Documentum
>  * Select them, right click and export as CSV (to show the timestamps):
> {noformat}
> 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM,
> 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat}
> Check the local time on the ManifoldCF server machine. Observe that it's 
> reporting consistent time with the DM server:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 09:07:25 BST 2018{noformat}
> Start the job and look for the query to Documentum in the manifoldcf.log file 
> (line break added for readability):
> {noformat}
> DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and
> r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') 
> AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) ))
> ^C{noformat}
> Notice that the latest date asked for is *before* the modification date of 
> the files added to DM. (And is an hour out, see footnote.)
>   
>  See whether anything has been output by the File System connector. It hasn't:
> {noformat}
> [james@manifold]$ ls /bigdisc/source/PDFs/timezones/
> [james@manifold]$
> {noformat}
> Now:
>  * change the timezone on the ManifoldCF server machine
>  * restart the ManifoldCF server and the Documentum processes
>  * reseed the job
> Check the local time on the ManifoldCF server machine; it has changed:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 10:10:29 CEST 2018{noformat}
> Start the job again and notice that the query has changed by an hour, plus 
> the few minutes it took to change the date etc (and is still an hour out, see 
> footnote):
> {noformat}
> r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') 
> {noformat}
> Observe that the range of dates now covers the timestamps on the DM data, and 
> also that some data has now been transferred by the File System connector:
> {noformat}
> [james@manifold]$ ls 
> /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/
> drl?versionLabel=CURRENT=09018000e515
> drl?versionLabel=CURRENT=09018000e516
> {noformat}
>  
>  
> [Footnote] It appears that something is trying to take account of Daylight 
> Saving Time too.
> If I set the server date to a time outside of DST, the query is aligned with 
> the current time:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Mon Oct 29 00:01:13 CET 2018
> r_modify_date<=date('10/29/2018 00:01:39','mm/dd/ hh:mi:ss') 
> {noformat}
> But if I set the time inside DST, the time is an hour before:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Sat Oct 27 00:00:

[jira] [Commented] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones

2018-08-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571462#comment-16571462
 ] 

Karl Wright commented on CONNECTORS-1521:
-

All I have access to indicates that IDfTime patterns are very iffy:

https://msroth.wordpress.com/2010/12/12/dftime-patterns/



> Documentum Connector users ManifoldCF's local time in queries constraints 
> against the Documentum server without reference to time zones
> ---
>
> Key: CONNECTORS-1521
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1521
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
>
> I find that the time/date constraints in queries to the Documentum server are 
> based on the "raw" local time of the ManifoldCF server but appear to take no 
> account of the time zones of the two servers.
> This can lead to recently modified files not being transferred to the output 
> repository when you would naturally expect them to be. I'd like the times to 
> be aligned, perhaps by including time zone in the query. In particular, is 
> there a way to use UTC perhaps?
> Here's an example ...
>  * create a folder in Documentum
>  * set up a job to point at the folder and output to the file system
>  * put two documents into a folder in Documentum
>  * Select them, right click and export as CSV (to show the timestamps):
> {noformat}
> 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM,
> 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat}
> Check the local time on the ManifoldCF server machine. Observe that it's 
> reporting consistent time with the DM server:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 09:07:25 BST 2018{noformat}
> Start the job and look for the query to Documentum in the manifoldcf.log file 
> (line break added for readability):
> {noformat}
> DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and
> r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') 
> AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) ))
> ^C{noformat}
> Notice that the latest date asked for is *before* the modification date of 
> the files added to DM. (And is an hour out, see footnote.)
>   
>  See whether anything has been output by the File System connector. It hasn't:
> {noformat}
> [james@manifold]$ ls /bigdisc/source/PDFs/timezones/
> [james@manifold]$
> {noformat}
> Now:
>  * change the timezone on the ManifoldCF server machine
>  * restart the ManifoldCF server and the Documentum processes
>  * reseed the job
> Check the local time on the ManifoldCF server machine; it has changed:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 10:10:29 CEST 2018{noformat}
> Start the job again and notice that the query has changed by an hour, plus 
> the few minutes it took to change the date etc (and is still an hour out, see 
> footnote):
> {noformat}
> r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') 
> {noformat}
> Observe that the range of dates now covers the timestamps on the DM data, and 
> also that some data has now been transferred by the File System connector:
> {noformat}
> [james@manifold]$ ls 
> /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/
> drl?versionLabel=CURRENT=09018000e515
> drl?versionLabel=CURRENT=09018000e516
> {noformat}
>  
>  
> [Footnote] It appears that something is trying to take account of Daylight 
> Saving Time too.
> If I set the server date to a time outside of DST, the query is aligned with 
> the current time:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Mon Oct 29 00:01:13 CET 2018
> r_modify_date<=date('10/29/2018 00:01:39','mm/dd/ hh:mi:ss') 
> {noformat}
> But if I set the time inside DST, the time is an hour before:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Sat Oct 27 00:00:06 CEST 2018
> r_modify_date<=date('10/26/2018 23:00:26','mm/dd/ hh:mi:ss') 
> {noformat}
> This is perhaps a Java issue rather than a logic issue in the connector? See 
> e.g. [https://stackoverflow.com/questions/6392/java-time-zone-is-messed-up]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones

2018-08-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571444#comment-16571444
 ] 

Karl Wright commented on CONNECTORS-1521:
-

The method that is used to build the date string is as follows:

{code}
session.buildDateString(timevalue);
{code}

... where timevalue is a time in ms since epoch.

The buildDateString code is here, in DocumentumImpl:

{code}
  /** Build a DQL date string from a long timestamp */
  public String buildDateString(long timestamp)
throws RemoteException
  {
return "date('"+new DfTime(new 
Date(timestamp)).asString(IDfTime.DF_TIME_PATTERN44)+"','"+IDfTime.DF_TIME_PATTERN44+"')";
  }
{code}

The Date object created is time-zone free (because that's the way Date objects 
work).  The asString() method probably uses local time though.

I can't change this without access to DFC documentation -- specifically, I need 
to know what DfTime formats are accepted, and if any UTC formats are 
understood, or if there are formats with timezones.  I also need the name of 
the appropriate IDfTime constant. 
 [~jamesthomas], is this something you can research?


> Documentum Connector users ManifoldCF's local time in queries constraints 
> against the Documentum server without reference to time zones
> ---
>
> Key: CONNECTORS-1521
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1521
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>    Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
>
> I find that the time/date constraints in queries to the Documentum server are 
> based on the "raw" local time of the ManifoldCF server but appear to take no 
> account of the time zones of the two servers.
> This can lead to recently modified files not being transferred to the output 
> repository when you would naturally expect them to be. I'd like the times to 
> be aligned, perhaps by including time zone in the query. In particular, is 
> there a way to use UTC perhaps?
> Here's an example ...
>  * create a folder in Documentum
>  * set up a job to point at the folder and output to the file system
>  * put two documents into a folder in Documentum
>  * Select them, right click and export as CSV (to show the timestamps):
> {noformat}
> 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM,
> 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat}
> Check the local time on the ManifoldCF server machine. Observe that it's 
> reporting consistent time with the DM server:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 09:07:25 BST 2018{noformat}
> Start the job and look for the query to Documentum in the manifoldcf.log file 
> (line break added for readability):
> {noformat}
> DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and
> r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') 
> AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) ))
> ^C{noformat}
> Notice that the latest date asked for is *before* the modification date of 
> the files added to DM. (And is an hour out, see footnote.)
>   
>  See whether anything has been output by the File System connector. It hasn't:
> {noformat}
> [james@manifold]$ ls /bigdisc/source/PDFs/timezones/
> [james@manifold]$
> {noformat}
> Now:
>  * change the timezone on the ManifoldCF server machine
>  * restart the ManifoldCF server and the Documentum processes
>  * reseed the job
> Check the local time on the ManifoldCF server machine; it has changed:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 10:10:29 CEST 2018{noformat}
> Start the job again and notice that the query has changed by an hour, plus 
> the few minutes it took to change the date etc (and is still an hour out, see 
> footnote):
> {noformat}
> r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') 
> {noformat}
> Observe that the range of dates now covers the timestamps on the DM data, and 
> also that some data has now been transferred by the File System connector:
> {noformat}
> [james@manifold]$ ls 
> /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/
> drl?versionLabel=CURRENT=09018000e515
> 

[jira] [Assigned] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones

2018-08-07 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1521:
---

Assignee: Karl Wright

> Documentum Connector users ManifoldCF's local time in queries constraints 
> against the Documentum server without reference to time zones
> ---
>
> Key: CONNECTORS-1521
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1521
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
>
> I find that the time/date constraints in queries to the Documentum server are 
> based on the "raw" local time of the ManifoldCF server but appear to take no 
> account of the time zones of the two servers.
> This can lead to recently modified files not being transferred to the output 
> repository when you would naturally expect them to be. I'd like the times to 
> be aligned, perhaps by including time zone in the query. In particular, is 
> there a way to use UTC perhaps?
> Here's an example ...
>  * create a folder in Documentum
>  * set up a job to point at the folder and output to the file system
>  * put two documents into a folder in Documentum
>  * Select them, right click and export as CSV (to show the timestamps):
> {noformat}
> 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM,
> 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat}
> Check the local time on the ManifoldCF server machine. Observe that it's 
> reporting consistent time with the DM server:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 09:07:25 BST 2018{noformat}
> Start the job and look for the query to Documentum in the manifoldcf.log file 
> (line break added for readability):
> {noformat}
> DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and
> r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') 
> AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) ))
> ^C{noformat}
> Notice that the latest date asked for is *before* the modification date of 
> the files added to DM. (And is an hour out, see footnote.)
>   
>  See whether anything has been output by the File System connector. It hasn't:
> {noformat}
> [james@manifold]$ ls /bigdisc/source/PDFs/timezones/
> [james@manifold]$
> {noformat}
> Now:
>  * change the timezone on the ManifoldCF server machine
>  * restart the ManifoldCF server and the Documentum processes
>  * reseed the job
> Check the local time on the ManifoldCF server machine; it has changed:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 10:10:29 CEST 2018{noformat}
> Start the job again and notice that the query has changed by an hour, plus 
> the few minutes it took to change the date etc (and is still an hour out, see 
> footnote):
> {noformat}
> r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') 
> {noformat}
> Observe that the range of dates now covers the timestamps on the DM data, and 
> also that some data has now been transferred by the File System connector:
> {noformat}
> [james@manifold]$ ls 
> /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/
> drl?versionLabel=CURRENT=09018000e515
> drl?versionLabel=CURRENT=09018000e516
> {noformat}
>  
>  
> [Footnote] It appears that something is trying to take account of Daylight 
> Saving Time too.
> If I set the server date to a time outside of DST, the query is aligned with 
> the current time:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Mon Oct 29 00:01:13 CET 2018
> r_modify_date<=date('10/29/2018 00:01:39','mm/dd/ hh:mi:ss') 
> {noformat}
> But if I set the time inside DST, the time is an hour before:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Sat Oct 27 00:00:06 CEST 2018
> r_modify_date<=date('10/26/2018 23:00:26','mm/dd/ hh:mi:ss') 
> {noformat}
> This is perhaps a Java issue rather than a logic issue in the connector? See 
> e.g. [https://stackoverflow.com/questions/6392/java-time-zone-is-messed-up]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1492) GSOC: Add support for Docker

2018-08-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571312#comment-16571312
 ] 

Karl Wright commented on CONNECTORS-1492:
-

[~piergiorgioluc...@gmail.com], I suspect that we (the ManifoldCF PMC) cannot 
distribute Docker images that contain open-source software whose license is 
incompatible with the Apache License.  That would include Postgresql and JCIFS 
connectors, for what it's worth, and we therefore might as well not bother.

You could open a LEGAL ticket to verify this, but seems pretty clear to me.

> GSOC: Add support for Docker
> 
>
> Key: CONNECTORS-1492
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1492
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: devops, docker, gsoc2018
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to adopt Docker to provide ready to use images with 
> preconfigured architecture stack for ManifoldCF. This will include ManifoldCF 
> itself but also the related database that can be MySQL, PostgreSQL and so on.
> This will help developers to work and put in production a complete ManifoldCF 
> installation.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write Docker files
>  * Write Docker Compose files
>  * Implement unit tests
>  * Build all the integration tests
>  * Write the documentation for new component
> We have a complete documentation about ManifioldCF:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/concepts.html]
> Take a look at our book to understand better the framework and how to extend 
> it in different ways:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: PostgreSQL version to support MCF v2.10

2018-08-06 Thread Karl Wright
It is what is expected with multiple threads active at the same time.
Karl


On Mon, Aug 6, 2018 at 7:26 AM Standen Guy 
wrote:

> Hi Karl,
>
> I haven’t experienced any job aborts, so all seems OK in that respect.
>
> Is there anything I can do to reduce these errors in the first place, or
> it is just to be expected with the nature of the multiple worker threads
> and the query types issued by ManifoldCF?
>
> Best Regards,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 06 August 2018 12:16
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: PostgreSQL version to support MCF v2.10
>
>
>
> These are exactly the same kind of issue as the first "error" reported.
> They will be retried.  If they did not get retried, they would abort the
> job immediately.
>
>
>
> Karl
>
>
>
>
>
> On Mon, Aug 6, 2018 at 6:57 AM Standen Guy 
> wrote:
>
> Hi Karl,
>
>Thanks for the prompt response regarding the first  error
> example.   Do you have a view as to second error  i.e.
>
> “2018-08-03 15:52:42.855 BST [5272] ERROR:  could not serialize access
> due to concurrent update
>
> 2018-08-03 15:52:42.855 BST [5272] STATEMENT:  SELECT id,status,checktime
> FROM jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
>
> 2018-08-03 15:52:42.855 BST [7424] ERROR:  could not serialize access due
> to concurrent update
>
> 2018-08-03 15:52:42.855 BST [7424] STATEMENT:  SELECT id,status,checktime
> FROM jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
>
> 2018-08-03 15:52:42.855 BST [5716] ERROR:  could not serialize access due
> to concurrent update
>
> “
>
>
>
> These errors don’t suggest a retry may sort them out  - is this an issue?
>
>
>
> Many Thanks,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 06 August 2018 10:52
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: PostgreSQL version to support MCF v2.10
>
>
>
> Ah, the following errors:
>
> >>>>>>
>
> 2018-08-03 15:52:25.218 BST [4140] ERROR:  could not serialize access due
> to read/write dependencies among transactions
>
> 2018-08-03 15:52:25.218 BST [4140] DETAIL:  Reason code: Canceled on
> identification as a pivot, during conflict in checking.
>
> 2018-08-03 15:52:25.218 BST [4140] HINT:  The transaction might succeed if
> retried.
>
> <<<<<<
>
>
>
> ... occur because of concurrent transactions.  The transaction is indeed
> retried when this occurs, so unless your job aborts, you are fine.
>
>
>
> Karl
>
>
>
>
>
> On Mon, Aug 6, 2018 at 5:49 AM Karl Wright  wrote:
>
> What errors are these?  Please include them and I can let you know.
>
>
>
> Karl
>
>
>
>
>
> On Mon, Aug 6, 2018 at 4:50 AM Standen Guy 
> wrote:
>
> Thank you Karl and Steph,
>
>
>
> Steph, yes I don’t seem to have any issues with running the MCF jobs, but
> am concerned about the PostgreSQL errors. Do you ( or anyone else)  have a
> view on the errors I have seen in the PostgreSQL logs  - is this something
> you have seen with 10.4  and if so was it corrected by changing some
> settings?
>
>
>
> Best Regards
>
>
>
> Guy
>
>
>
> *From:* Steph van Schalkwyk [mailto:st...@remcam.net]
> *Sent:* 03 August 2018 23:21
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: PostgreSQL version to support MCF v2.10
>
>
>
> I'm using 10.4 with no issues.
>
> One or two of the recommended settings for MCF have changed between 9.6
> and 10.
>
> Simple to resolve though.
>
> Steph
>
>
>
>
>
>
> On Fri, Aug 3, 2018 at 1:29 PM, Karl Wright  wrote:
>
> Hi Guy,
>
>
>
> I use Postgresql 9.6 myself and have found no issues with it.  I don't
> know about v 10 however.
>
>
>
> Karl
>
>
>
>
>
> On Fri, Aug 3, 2018 at 11:32 AM Standen Guy 
> wrote:
>
> Hi Karl/All,
>
>I am upgrading from MCF v2.6  supported by PostgreSQL v
> 9.3.16   to  MCF v2.10.  I wonder if there is any official advice as to
> which version of PostgreSQL  will support  MCF v2.10? The  MCF v2.10 build
> and deployment instructions still suggest that PostgreSQL 9.3 is the latest
> tested version of PostgreSQL.  Given that PostgreSQL 9.3.x  is going end of
> life next month ( Sept 2018), is there a preferred newer version that
> should be used?
>
>
>
> As an experiment I have installed MCF 2.10  supported by PostgreSQL 10.4.
> From the outside all seems to work OK, but investigation of the PostgreSQL
> logs shows a lot of errors:
>

Re: PostgreSQL version to support MCF v2.10

2018-08-06 Thread Karl Wright
These are exactly the same kind of issue as the first "error" reported.
They will be retried.  If they did not get retried, they would abort the
job immediately.

Karl


On Mon, Aug 6, 2018 at 6:57 AM Standen Guy 
wrote:

> Hi Karl,
>
>Thanks for the prompt response regarding the first  error
> example.   Do you have a view as to second error  i.e.
>
> “2018-08-03 15:52:42.855 BST [5272] ERROR:  could not serialize access
> due to concurrent update
>
> 2018-08-03 15:52:42.855 BST [5272] STATEMENT:  SELECT id,status,checktime
> FROM jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
>
> 2018-08-03 15:52:42.855 BST [7424] ERROR:  could not serialize access due
> to concurrent update
>
> 2018-08-03 15:52:42.855 BST [7424] STATEMENT:  SELECT id,status,checktime
> FROM jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
>
> 2018-08-03 15:52:42.855 BST [5716] ERROR:  could not serialize access due
> to concurrent update
>
> “
>
>
>
> These errors don’t suggest a retry may sort them out  - is this an issue?
>
>
>
> Many Thanks,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 06 August 2018 10:52
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: PostgreSQL version to support MCF v2.10
>
>
>
> Ah, the following errors:
>
> >>>>>>
>
> 2018-08-03 15:52:25.218 BST [4140] ERROR:  could not serialize access due
> to read/write dependencies among transactions
>
> 2018-08-03 15:52:25.218 BST [4140] DETAIL:  Reason code: Canceled on
> identification as a pivot, during conflict in checking.
>
> 2018-08-03 15:52:25.218 BST [4140] HINT:  The transaction might succeed if
> retried.
>
> <<<<<<
>
>
>
> ... occur because of concurrent transactions.  The transaction is indeed
> retried when this occurs, so unless your job aborts, you are fine.
>
>
>
> Karl
>
>
>
>
>
> On Mon, Aug 6, 2018 at 5:49 AM Karl Wright  wrote:
>
> What errors are these?  Please include them and I can let you know.
>
>
>
> Karl
>
>
>
>
>
> On Mon, Aug 6, 2018 at 4:50 AM Standen Guy 
> wrote:
>
> Thank you Karl and Steph,
>
>
>
> Steph, yes I don’t seem to have any issues with running the MCF jobs, but
> am concerned about the PostgreSQL errors. Do you ( or anyone else)  have a
> view on the errors I have seen in the PostgreSQL logs  - is this something
> you have seen with 10.4  and if so was it corrected by changing some
> settings?
>
>
>
> Best Regards
>
>
>
> Guy
>
>
>
> *From:* Steph van Schalkwyk [mailto:st...@remcam.net]
> *Sent:* 03 August 2018 23:21
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: PostgreSQL version to support MCF v2.10
>
>
>
> I'm using 10.4 with no issues.
>
> One or two of the recommended settings for MCF have changed between 9.6
> and 10.
>
> Simple to resolve though.
>
> Steph
>
>
>
>
>
>
> On Fri, Aug 3, 2018 at 1:29 PM, Karl Wright  wrote:
>
> Hi Guy,
>
>
>
> I use Postgresql 9.6 myself and have found no issues with it.  I don't
> know about v 10 however.
>
>
>
> Karl
>
>
>
>
>
> On Fri, Aug 3, 2018 at 11:32 AM Standen Guy 
> wrote:
>
> Hi Karl/All,
>
>I am upgrading from MCF v2.6  supported by PostgreSQL v
> 9.3.16   to  MCF v2.10.  I wonder if there is any official advice as to
> which version of PostgreSQL  will support  MCF v2.10? The  MCF v2.10 build
> and deployment instructions still suggest that PostgreSQL 9.3 is the latest
> tested version of PostgreSQL.  Given that PostgreSQL 9.3.x  is going end of
> life next month ( Sept 2018), is there a preferred newer version that
> should be used?
>
>
>
> As an experiment I have installed MCF 2.10  supported by PostgreSQL 10.4.
> From the outside all seems to work OK, but investigation of the PostgreSQL
> logs shows a lot of errors:
>
>
>
> e.g.
>
> “2018-08-03 15:50:00.629 BST [7920] LOG:  database system was shut down at
> 2018-08-03 15:47:30 BST
>
> 2018-08-03 15:50:00.734 BST [6344] LOG:  database system is ready to
> accept connections
>
> 2018-08-03 15:52:11.140 BST [6460] WARNING:  there is already a
> transaction in progress
>
> 2018-08-03 15:52:11.219 BST [6460] WARNING:  there is no transaction in
> progress
>
> 2018-08-03 15:52:13.844 BST [5716] WARNING:  there is already a
> transaction in progress
>
> 2018-08-03 15:52:13.879 BST [5716] WARNING:  there is no transaction in
> progress
>
> 2018-08-03 15:52:25.218 BST [4140] ERROR:  could not serialize access due
> to read/write dependencies among tra

It's release time again

2018-08-06 Thread Karl Wright
I'm hoping to cut RC0 of 2.11 around August 15th. Any objection?

Karl


[jira] [Assigned] (LUCENE-8444) Geo3D Test Failure: Test Point is Contained by shape but outside the XYZBounds

2018-08-06 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned LUCENE-8444:
---

Assignee: Ignacio Vera  (was: Karl Wright)

> Geo3D Test Failure: Test Point is Contained by shape but outside the XYZBounds
> --
>
> Key: LUCENE-8444
> URL: https://issues.apache.org/jira/browse/LUCENE-8444
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Nicholas Knize
>Assignee: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8444.patch
>
>
> Reproduces for me on branch_7x.  /cc [~daddywri]  [~ivera]
> {code:java}
> reproduce with: ant test  -Dtestcase=TestGeo3DPoint 
> -Dtests.method=testGeo3DRelations -Dtests.seed=252B55C41A78F987 
> -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th 
> -Dtests.timezone=America/Virgin -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> {code}
> {code:java}
> [junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint
>[junit4]   1> doc=639 is contained by shape but is outside the 
> returned XYZBounds
>[junit4]   1>   unquantized=[lat=-1.077431832267001, 
> lon=3.141592653589793([X=-0.47288721079787505, Y=5.791198090613375E-17, 
> Z=-0.8794340737031547])]
>[junit4]   1>   quantized=[X=-0.47288721059145067, 
> Y=2.3309121299774915E-10, Z=-0.8794340734858216]
>[junit4]   1> doc=1079 is contained by shape but is outside the 
> returned XYZBounds
>[junit4]   1>   unquantized=[lat=-1.074298280522397, 
> lon=-3.141592653589793([X=-0.4756448135017662, Y=-5.824968983859777E-17, 
> Z=-0.8779556514050441])]
>[junit4]   1>   quantized=[X=-0.4756448134355703, 
> Y=-2.3309121299774915E-10, Z=-0.8779556514433299]
>[junit4]   1>   shape=GeoComplexPolygon: {planetmodel=PlanetModel.WGS84, 
> number of shapes=1, address=5b34ab34, testPoint=[lat=-0.9074319066955279, 
> lon=2.1047077826887393E-11([X=0.6151745825332513, Y=1.2947627315700302E-11, 
> Z=-0.7871615107396388])], testPointInSet=true, shapes={ 
> {[lat=0.12234154783984401, lon=2.9773900430735544E-11([X=0.9935862314832985, 
> Y=2.9582937525533484E-11, Z=0.12216699617265761])], [lat=-1.1812619187738946, 
> lon=0.0([X=0.3790909950565304, Y=0.0, Z=-0.9234617794363308])], 
> [lat=-1.5378336326638269, lon=-2.177768668411E-97([X=0.03288309726634029, 
> Y=-7.161177895900688E-99, Z=-0.9972239126272725])]}}
>[junit4]   1>   bounds=XYZBounds: [xmin=0.03288309626634029 
> xmax=1.0011188549924792 ymin=-1.0E-9 ymax=1.029686850221785E-9 
> zmin=-0.9972239136272725 zmax=0.12216699717265761]
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestGeo3DPoint 
> -Dtests.method=testGeo3DRelations -Dtests.seed=252B55C41A78F987 
> -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th 
> -Dtests.timezone=America/Virgin -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>[junit4] FAILURE 0.16s | TestGeo3DPoint.testGeo3DRelations <<<
>[junit4]> Throwable #1: java.lang.AssertionError: invalid bounds for 
> shape=GeoComplexPolygon: {planetmodel=PlanetModel.WGS84, number of shapes=1, 
> address=5b34ab34, testPoint=[lat=-0.9074319066955279, 
> lon=2.1047077826887393E-11([X=0.6151745825332513, Y=1.2947627315700302E-11, 
> Z=-0.7871615107396388])], testPointInSet=true, shapes={ 
> {[lat=0.12234154783984401, lon=2.9773900430735544E-11([X=0.9935862314832985, 
> Y=2.9582937525533484E-11, Z=0.12216699617265761])], [lat=-1.1812619187738946, 
> lon=0.0([X=0.3790909950565304, Y=0.0, Z=-0.9234617794363308])], 
> [lat=-1.5378336326638269, lon=-2.177768668411E-97([X=0.03288309726634029, 
> Y=-7.161177895900688E-99, Z=-0.9972239126272725])]}}
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([252B55C41A78F987:955428509535571B]:0)
>[junit4]>  at 
> org.apache.lucene.spatial3d.TestGeo3DPoint.testGeo3DRelations(TestGeo3DPoint.java:259)
>[junit4]>  at java.lang.Thread.run(Thread.java:748)
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70), 
> sim=RandomSimilarity(queryNorm=false): {}, locale=th, timezone=America/Virgin
>[junit4]   2> NOTE: Linux 4.15.0-29-generic amd64/Oracle Corporation 
> 1.8.0_161 (64-bit)/cpus=4,threads=1,free=298939008,total=313524224
>[junit4]   2> NOTE: All tests run in this JVM: [TestGeo3DPoint]
>[junit4] Completed [1/1 (1!)] in 0.62s, 1 test, 1 failure <<< FAILURES!
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8444) Geo3D Test Failure: Test Point is Contained by shape but outside the XYZBounds

2018-08-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570011#comment-16570011
 ] 

Karl Wright commented on LUCENE-8444:
-

[~ivera] That sounds like the proper fix then.  It's exactly what functionally 
identical was meant to capture.  I'll assign the ticket to you and you can 
commit the test and the fix.


> Geo3D Test Failure: Test Point is Contained by shape but outside the XYZBounds
> --
>
> Key: LUCENE-8444
> URL: https://issues.apache.org/jira/browse/LUCENE-8444
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Nicholas Knize
>    Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8444.patch
>
>
> Reproduces for me on branch_7x.  /cc [~daddywri]  [~ivera]
> {code:java}
> reproduce with: ant test  -Dtestcase=TestGeo3DPoint 
> -Dtests.method=testGeo3DRelations -Dtests.seed=252B55C41A78F987 
> -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th 
> -Dtests.timezone=America/Virgin -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> {code}
> {code:java}
> [junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint
>[junit4]   1> doc=639 is contained by shape but is outside the 
> returned XYZBounds
>[junit4]   1>   unquantized=[lat=-1.077431832267001, 
> lon=3.141592653589793([X=-0.47288721079787505, Y=5.791198090613375E-17, 
> Z=-0.8794340737031547])]
>[junit4]   1>   quantized=[X=-0.47288721059145067, 
> Y=2.3309121299774915E-10, Z=-0.8794340734858216]
>[junit4]   1> doc=1079 is contained by shape but is outside the 
> returned XYZBounds
>[junit4]   1>   unquantized=[lat=-1.074298280522397, 
> lon=-3.141592653589793([X=-0.4756448135017662, Y=-5.824968983859777E-17, 
> Z=-0.8779556514050441])]
>[junit4]   1>   quantized=[X=-0.4756448134355703, 
> Y=-2.3309121299774915E-10, Z=-0.8779556514433299]
>[junit4]   1>   shape=GeoComplexPolygon: {planetmodel=PlanetModel.WGS84, 
> number of shapes=1, address=5b34ab34, testPoint=[lat=-0.9074319066955279, 
> lon=2.1047077826887393E-11([X=0.6151745825332513, Y=1.2947627315700302E-11, 
> Z=-0.7871615107396388])], testPointInSet=true, shapes={ 
> {[lat=0.12234154783984401, lon=2.9773900430735544E-11([X=0.9935862314832985, 
> Y=2.9582937525533484E-11, Z=0.12216699617265761])], [lat=-1.1812619187738946, 
> lon=0.0([X=0.3790909950565304, Y=0.0, Z=-0.9234617794363308])], 
> [lat=-1.5378336326638269, lon=-2.177768668411E-97([X=0.03288309726634029, 
> Y=-7.161177895900688E-99, Z=-0.9972239126272725])]}}
>[junit4]   1>   bounds=XYZBounds: [xmin=0.03288309626634029 
> xmax=1.0011188549924792 ymin=-1.0E-9 ymax=1.029686850221785E-9 
> zmin=-0.9972239136272725 zmax=0.12216699717265761]
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestGeo3DPoint 
> -Dtests.method=testGeo3DRelations -Dtests.seed=252B55C41A78F987 
> -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th 
> -Dtests.timezone=America/Virgin -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>[junit4] FAILURE 0.16s | TestGeo3DPoint.testGeo3DRelations <<<
>[junit4]> Throwable #1: java.lang.AssertionError: invalid bounds for 
> shape=GeoComplexPolygon: {planetmodel=PlanetModel.WGS84, number of shapes=1, 
> address=5b34ab34, testPoint=[lat=-0.9074319066955279, 
> lon=2.1047077826887393E-11([X=0.6151745825332513, Y=1.2947627315700302E-11, 
> Z=-0.7871615107396388])], testPointInSet=true, shapes={ 
> {[lat=0.12234154783984401, lon=2.9773900430735544E-11([X=0.9935862314832985, 
> Y=2.9582937525533484E-11, Z=0.12216699617265761])], [lat=-1.1812619187738946, 
> lon=0.0([X=0.3790909950565304, Y=0.0, Z=-0.9234617794363308])], 
> [lat=-1.5378336326638269, lon=-2.177768668411E-97([X=0.03288309726634029, 
> Y=-7.161177895900688E-99, Z=-0.9972239126272725])]}}
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([252B55C41A78F987:955428509535571B]:0)
>[junit4]>  at 
> org.apache.lucene.spatial3d.TestGeo3DPoint.testGeo3DRelations(TestGeo3DPoint.java:259)
>[junit4]>  at java.lang.Thread.run(Thread.java:748)
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70), 
> sim=RandomSimilarity(queryNorm=false): {}, locale=th, timezone=America/Virgin
>[junit4]   2> NOTE: Linux 4.15.0-29-generic amd64/Oracle Corporation 
> 1.8.0_161 (64-bit)/cpus=4,threads=1,free=298939008,total=313524224
>[junit4]   2> NOTE: All tests run in this JVM: [TestGeo3DPoint]
>[junit4] Compl

[jira] [Commented] (LUCENE-8444) Geo3D Test Failure: Test Point is Contained by shape but outside the XYZBounds

2018-08-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569976#comment-16569976
 ] 

Karl Wright commented on LUCENE-8444:
-

[~ivera], identical cutoff planes are bad news.

If we detect such a condition we should throw the appropriate exception and let 
it look for another DualCrossing arrangement where that doesn't happen.  There 
are typically a half-dozen possibilities to choose from.  Can you be more 
specific about where the check is done?


> Geo3D Test Failure: Test Point is Contained by shape but outside the XYZBounds
> --
>
> Key: LUCENE-8444
> URL: https://issues.apache.org/jira/browse/LUCENE-8444
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Nicholas Knize
>    Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8444.patch
>
>
> Reproduces for me on branch_7x.  /cc [~daddywri]  [~ivera]
> {code:java}
> reproduce with: ant test  -Dtestcase=TestGeo3DPoint 
> -Dtests.method=testGeo3DRelations -Dtests.seed=252B55C41A78F987 
> -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th 
> -Dtests.timezone=America/Virgin -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> {code}
> {code:java}
> [junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint
>[junit4]   1> doc=639 is contained by shape but is outside the 
> returned XYZBounds
>[junit4]   1>   unquantized=[lat=-1.077431832267001, 
> lon=3.141592653589793([X=-0.47288721079787505, Y=5.791198090613375E-17, 
> Z=-0.8794340737031547])]
>[junit4]   1>   quantized=[X=-0.47288721059145067, 
> Y=2.3309121299774915E-10, Z=-0.8794340734858216]
>[junit4]   1> doc=1079 is contained by shape but is outside the 
> returned XYZBounds
>[junit4]   1>   unquantized=[lat=-1.074298280522397, 
> lon=-3.141592653589793([X=-0.4756448135017662, Y=-5.824968983859777E-17, 
> Z=-0.8779556514050441])]
>[junit4]   1>   quantized=[X=-0.4756448134355703, 
> Y=-2.3309121299774915E-10, Z=-0.8779556514433299]
>[junit4]   1>   shape=GeoComplexPolygon: {planetmodel=PlanetModel.WGS84, 
> number of shapes=1, address=5b34ab34, testPoint=[lat=-0.9074319066955279, 
> lon=2.1047077826887393E-11([X=0.6151745825332513, Y=1.2947627315700302E-11, 
> Z=-0.7871615107396388])], testPointInSet=true, shapes={ 
> {[lat=0.12234154783984401, lon=2.9773900430735544E-11([X=0.9935862314832985, 
> Y=2.9582937525533484E-11, Z=0.12216699617265761])], [lat=-1.1812619187738946, 
> lon=0.0([X=0.3790909950565304, Y=0.0, Z=-0.9234617794363308])], 
> [lat=-1.5378336326638269, lon=-2.177768668411E-97([X=0.03288309726634029, 
> Y=-7.161177895900688E-99, Z=-0.9972239126272725])]}}
>[junit4]   1>   bounds=XYZBounds: [xmin=0.03288309626634029 
> xmax=1.0011188549924792 ymin=-1.0E-9 ymax=1.029686850221785E-9 
> zmin=-0.9972239136272725 zmax=0.12216699717265761]
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestGeo3DPoint 
> -Dtests.method=testGeo3DRelations -Dtests.seed=252B55C41A78F987 
> -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th 
> -Dtests.timezone=America/Virgin -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>[junit4] FAILURE 0.16s | TestGeo3DPoint.testGeo3DRelations <<<
>[junit4]> Throwable #1: java.lang.AssertionError: invalid bounds for 
> shape=GeoComplexPolygon: {planetmodel=PlanetModel.WGS84, number of shapes=1, 
> address=5b34ab34, testPoint=[lat=-0.9074319066955279, 
> lon=2.1047077826887393E-11([X=0.6151745825332513, Y=1.2947627315700302E-11, 
> Z=-0.7871615107396388])], testPointInSet=true, shapes={ 
> {[lat=0.12234154783984401, lon=2.9773900430735544E-11([X=0.9935862314832985, 
> Y=2.9582937525533484E-11, Z=0.12216699617265761])], [lat=-1.1812619187738946, 
> lon=0.0([X=0.3790909950565304, Y=0.0, Z=-0.9234617794363308])], 
> [lat=-1.5378336326638269, lon=-2.177768668411E-97([X=0.03288309726634029, 
> Y=-7.161177895900688E-99, Z=-0.9972239126272725])]}}
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([252B55C41A78F987:955428509535571B]:0)
>[junit4]>  at 
> org.apache.lucene.spatial3d.TestGeo3DPoint.testGeo3DRelations(TestGeo3DPoint.java:259)
>[junit4]>  at java.lang.Thread.run(Thread.java:748)
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70), 
> sim=RandomSimilarity(queryNorm=false): {}, locale=th, timezone=America/Virgin
>[junit4]   2> NOTE: Linux 4.15.0-29-generic amd64/Oracle Corporation 
> 1.

[jira] [Assigned] (LUCENE-8444) Geo3D Test Failure: Test Point is Contained by shape but outside the XYZBounds

2018-08-06 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned LUCENE-8444:
---

Assignee: Karl Wright

> Geo3D Test Failure: Test Point is Contained by shape but outside the XYZBounds
> --
>
> Key: LUCENE-8444
> URL: https://issues.apache.org/jira/browse/LUCENE-8444
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Nicholas Knize
>    Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8444.patch
>
>
> Reproduces for me on branch_7x.  /cc [~daddywri]  [~ivera]
> {code:java}
> reproduce with: ant test  -Dtestcase=TestGeo3DPoint 
> -Dtests.method=testGeo3DRelations -Dtests.seed=252B55C41A78F987 
> -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th 
> -Dtests.timezone=America/Virgin -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> {code}
> {code:java}
> [junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint
>[junit4]   1> doc=639 is contained by shape but is outside the 
> returned XYZBounds
>[junit4]   1>   unquantized=[lat=-1.077431832267001, 
> lon=3.141592653589793([X=-0.47288721079787505, Y=5.791198090613375E-17, 
> Z=-0.8794340737031547])]
>[junit4]   1>   quantized=[X=-0.47288721059145067, 
> Y=2.3309121299774915E-10, Z=-0.8794340734858216]
>[junit4]   1> doc=1079 is contained by shape but is outside the 
> returned XYZBounds
>[junit4]   1>   unquantized=[lat=-1.074298280522397, 
> lon=-3.141592653589793([X=-0.4756448135017662, Y=-5.824968983859777E-17, 
> Z=-0.8779556514050441])]
>[junit4]   1>   quantized=[X=-0.4756448134355703, 
> Y=-2.3309121299774915E-10, Z=-0.8779556514433299]
>[junit4]   1>   shape=GeoComplexPolygon: {planetmodel=PlanetModel.WGS84, 
> number of shapes=1, address=5b34ab34, testPoint=[lat=-0.9074319066955279, 
> lon=2.1047077826887393E-11([X=0.6151745825332513, Y=1.2947627315700302E-11, 
> Z=-0.7871615107396388])], testPointInSet=true, shapes={ 
> {[lat=0.12234154783984401, lon=2.9773900430735544E-11([X=0.9935862314832985, 
> Y=2.9582937525533484E-11, Z=0.12216699617265761])], [lat=-1.1812619187738946, 
> lon=0.0([X=0.3790909950565304, Y=0.0, Z=-0.9234617794363308])], 
> [lat=-1.5378336326638269, lon=-2.177768668411E-97([X=0.03288309726634029, 
> Y=-7.161177895900688E-99, Z=-0.9972239126272725])]}}
>[junit4]   1>   bounds=XYZBounds: [xmin=0.03288309626634029 
> xmax=1.0011188549924792 ymin=-1.0E-9 ymax=1.029686850221785E-9 
> zmin=-0.9972239136272725 zmax=0.12216699717265761]
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestGeo3DPoint 
> -Dtests.method=testGeo3DRelations -Dtests.seed=252B55C41A78F987 
> -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th 
> -Dtests.timezone=America/Virgin -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>[junit4] FAILURE 0.16s | TestGeo3DPoint.testGeo3DRelations <<<
>[junit4]> Throwable #1: java.lang.AssertionError: invalid bounds for 
> shape=GeoComplexPolygon: {planetmodel=PlanetModel.WGS84, number of shapes=1, 
> address=5b34ab34, testPoint=[lat=-0.9074319066955279, 
> lon=2.1047077826887393E-11([X=0.6151745825332513, Y=1.2947627315700302E-11, 
> Z=-0.7871615107396388])], testPointInSet=true, shapes={ 
> {[lat=0.12234154783984401, lon=2.9773900430735544E-11([X=0.9935862314832985, 
> Y=2.9582937525533484E-11, Z=0.12216699617265761])], [lat=-1.1812619187738946, 
> lon=0.0([X=0.3790909950565304, Y=0.0, Z=-0.9234617794363308])], 
> [lat=-1.5378336326638269, lon=-2.177768668411E-97([X=0.03288309726634029, 
> Y=-7.161177895900688E-99, Z=-0.9972239126272725])]}}
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([252B55C41A78F987:955428509535571B]:0)
>[junit4]>  at 
> org.apache.lucene.spatial3d.TestGeo3DPoint.testGeo3DRelations(TestGeo3DPoint.java:259)
>[junit4]>  at java.lang.Thread.run(Thread.java:748)
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70), 
> sim=RandomSimilarity(queryNorm=false): {}, locale=th, timezone=America/Virgin
>[junit4]   2> NOTE: Linux 4.15.0-29-generic amd64/Oracle Corporation 
> 1.8.0_161 (64-bit)/cpus=4,threads=1,free=298939008,total=313524224
>[junit4]   2> NOTE: All tests run in this JVM: [TestGeo3DPoint]
>[junit4] Completed [1/1 (1!)] in 0.62s, 1 test, 1 failure <<< FAILURES!
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: PostgreSQL version to support MCF v2.10

2018-08-06 Thread Karl Wright
Ah, the following errors:

>>>>>>

2018-08-03 15:52:25.218 BST [4140] ERROR:  could not serialize access due
to read/write dependencies among transactions

2018-08-03 15:52:25.218 BST [4140] DETAIL:  Reason code: Canceled on
identification as a pivot, during conflict in checking.

2018-08-03 15:52:25.218 BST [4140] HINT:  The transaction might succeed if
retried.

<<<<<<


... occur because of concurrent transactions.  The transaction is indeed
retried when this occurs, so unless your job aborts, you are fine.


Karl



On Mon, Aug 6, 2018 at 5:49 AM Karl Wright  wrote:

> What errors are these?  Please include them and I can let you know.
>
> Karl
>
>
> On Mon, Aug 6, 2018 at 4:50 AM Standen Guy 
> wrote:
>
>> Thank you Karl and Steph,
>>
>>
>>
>> Steph, yes I don’t seem to have any issues with running the MCF jobs, but
>> am concerned about the PostgreSQL errors. Do you ( or anyone else)  have a
>> view on the errors I have seen in the PostgreSQL logs  - is this something
>> you have seen with 10.4  and if so was it corrected by changing some
>> settings?
>>
>>
>>
>> Best Regards
>>
>>
>>
>> Guy
>>
>>
>>
>> *From:* Steph van Schalkwyk [mailto:st...@remcam.net]
>> *Sent:* 03 August 2018 23:21
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: PostgreSQL version to support MCF v2.10
>>
>>
>>
>> I'm using 10.4 with no issues.
>>
>> One or two of the recommended settings for MCF have changed between 9.6
>> and 10.
>>
>> Simple to resolve though.
>>
>> Steph
>>
>>
>>
>>
>>
>>
>> On Fri, Aug 3, 2018 at 1:29 PM, Karl Wright  wrote:
>>
>> Hi Guy,
>>
>>
>>
>> I use Postgresql 9.6 myself and have found no issues with it.  I don't
>> know about v 10 however.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Fri, Aug 3, 2018 at 11:32 AM Standen Guy 
>> wrote:
>>
>> Hi Karl/All,
>>
>>I am upgrading from MCF v2.6  supported by PostgreSQL v
>> 9.3.16   to  MCF v2.10.  I wonder if there is any official advice as to
>> which version of PostgreSQL  will support  MCF v2.10? The  MCF v2.10 build
>> and deployment instructions still suggest that PostgreSQL 9.3 is the latest
>> tested version of PostgreSQL.  Given that PostgreSQL 9.3.x  is going end of
>> life next month ( Sept 2018), is there a preferred newer version that
>> should be used?
>>
>>
>>
>> As an experiment I have installed MCF 2.10  supported by PostgreSQL
>> 10.4.  From the outside all seems to work OK, but investigation of the
>> PostgreSQL  logs shows a lot of errors:
>>
>>
>>
>> e.g.
>>
>> “2018-08-03 15:50:00.629 BST [7920] LOG:  database system was shut down
>> at 2018-08-03 15:47:30 BST
>>
>> 2018-08-03 15:50:00.734 BST [6344] LOG:  database system is ready to
>> accept connections
>>
>> 2018-08-03 15:52:11.140 BST [6460] WARNING:  there is already a
>> transaction in progress
>>
>> 2018-08-03 15:52:11.219 BST [6460] WARNING:  there is no transaction in
>> progress
>>
>> 2018-08-03 15:52:13.844 BST [5716] WARNING:  there is already a
>> transaction in progress
>>
>> 2018-08-03 15:52:13.879 BST [5716] WARNING:  there is no transaction in
>> progress
>>
>> 2018-08-03 15:52:25.218 BST [4140] ERROR:  could not serialize access due
>> to read/write dependencies among transactions
>>
>> 2018-08-03 15:52:25.218 BST [4140] DETAIL:  Reason code: Canceled on
>> identification as a pivot, during conflict in checking.
>>
>> 2018-08-03 15:52:25.218 BST [4140] HINT:  The transaction might succeed
>> if retried.
>>
>> 2018-08-03 15:52:25.218 BST [4140] STATEMENT:  INSERT INTO jobqueue
>> (jobid,docpriority,checktime,docid,needpriority,dochash,id,checkaction,status)
>> VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9)
>>
>> 2018-08-03 15:52:25.219 BST [5800] ERROR:  could not serialize access due
>> to read/write dependencies among transactions
>>
>> 2018-08-03 15:52:25.219 BST [5800] DETAIL:  Reason code: Canceled on
>> identification as a pivot, during conflict in checking.
>>
>> 2018-08-03 15:52:25.219 BST [5800] HINT:  The transaction might succeed
>> if retried.
>>
>> 2018-08-03 15:52:25.219 BST [5800] STATEMENT:  INSERT INTO jobqueue
>> (jobid,docpriority,checktime,docid,needpriority,dochash,id,checkaction,status)
>> VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9)
>>
>> 2018-08-03

Re: PostgreSQL version to support MCF v2.10

2018-08-06 Thread Karl Wright
What errors are these?  Please include them and I can let you know.

Karl


On Mon, Aug 6, 2018 at 4:50 AM Standen Guy 
wrote:

> Thank you Karl and Steph,
>
>
>
> Steph, yes I don’t seem to have any issues with running the MCF jobs, but
> am concerned about the PostgreSQL errors. Do you ( or anyone else)  have a
> view on the errors I have seen in the PostgreSQL logs  - is this something
> you have seen with 10.4  and if so was it corrected by changing some
> settings?
>
>
>
> Best Regards
>
>
>
> Guy
>
>
>
> *From:* Steph van Schalkwyk [mailto:st...@remcam.net]
> *Sent:* 03 August 2018 23:21
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: PostgreSQL version to support MCF v2.10
>
>
>
> I'm using 10.4 with no issues.
>
> One or two of the recommended settings for MCF have changed between 9.6
> and 10.
>
> Simple to resolve though.
>
> Steph
>
>
>
>
>
>
> On Fri, Aug 3, 2018 at 1:29 PM, Karl Wright  wrote:
>
> Hi Guy,
>
>
>
> I use Postgresql 9.6 myself and have found no issues with it.  I don't
> know about v 10 however.
>
>
>
> Karl
>
>
>
>
>
> On Fri, Aug 3, 2018 at 11:32 AM Standen Guy 
> wrote:
>
> Hi Karl/All,
>
>I am upgrading from MCF v2.6  supported by PostgreSQL v
> 9.3.16   to  MCF v2.10.  I wonder if there is any official advice as to
> which version of PostgreSQL  will support  MCF v2.10? The  MCF v2.10 build
> and deployment instructions still suggest that PostgreSQL 9.3 is the latest
> tested version of PostgreSQL.  Given that PostgreSQL 9.3.x  is going end of
> life next month ( Sept 2018), is there a preferred newer version that
> should be used?
>
>
>
> As an experiment I have installed MCF 2.10  supported by PostgreSQL 10.4.
> From the outside all seems to work OK, but investigation of the PostgreSQL
> logs shows a lot of errors:
>
>
>
> e.g.
>
> “2018-08-03 15:50:00.629 BST [7920] LOG:  database system was shut down at
> 2018-08-03 15:47:30 BST
>
> 2018-08-03 15:50:00.734 BST [6344] LOG:  database system is ready to
> accept connections
>
> 2018-08-03 15:52:11.140 BST [6460] WARNING:  there is already a
> transaction in progress
>
> 2018-08-03 15:52:11.219 BST [6460] WARNING:  there is no transaction in
> progress
>
> 2018-08-03 15:52:13.844 BST [5716] WARNING:  there is already a
> transaction in progress
>
> 2018-08-03 15:52:13.879 BST [5716] WARNING:  there is no transaction in
> progress
>
> 2018-08-03 15:52:25.218 BST [4140] ERROR:  could not serialize access due
> to read/write dependencies among transactions
>
> 2018-08-03 15:52:25.218 BST [4140] DETAIL:  Reason code: Canceled on
> identification as a pivot, during conflict in checking.
>
> 2018-08-03 15:52:25.218 BST [4140] HINT:  The transaction might succeed if
> retried.
>
> 2018-08-03 15:52:25.218 BST [4140] STATEMENT:  INSERT INTO jobqueue
> (jobid,docpriority,checktime,docid,needpriority,dochash,id,checkaction,status)
> VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9)
>
> 2018-08-03 15:52:25.219 BST [5800] ERROR:  could not serialize access due
> to read/write dependencies among transactions
>
> 2018-08-03 15:52:25.219 BST [5800] DETAIL:  Reason code: Canceled on
> identification as a pivot, during conflict in checking.
>
> 2018-08-03 15:52:25.219 BST [5800] HINT:  The transaction might succeed if
> retried.
>
> 2018-08-03 15:52:25.219 BST [5800] STATEMENT:  INSERT INTO jobqueue
> (jobid,docpriority,checktime,docid,needpriority,dochash,id,checkaction,status)
> VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9)
>
> 2018-08-03 15:52:25.222 BST [5692] ERROR:  could not serialize access due
> to read/write dependencies among transactions
>
> 2018-08-03 15:52:25.222 BST [5692] DETAIL:  Reason code: Canceled on
> identification as a pivot, during conflict in checking.
>
> 2018-08-03 15:52:25.222 BST [5692] HINT:  The transaction might succeed if
> retried.
>
> 2018-08-03 15:52:25.222 BST [5692] STATEMENT:  INSERT INTO jobqueue
> (jobid,docpriority,checktime,docid,needpriority,dochash,id,checkaction,status)
> VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9)
>
> 2018-08-03 15:52:28.149 BST [4140] ERROR:  could not serialize access due
> to read/write dependencies among transactions
>
> 2018-08-03 15:52:28.149 BST [4140] DETAIL:  Reason code: Canceled on
> identification as a pivot, during write.
>
> 2018-08-03 15:52:28.149 BST [4140] HINT:  The transaction might succeed if
> retried.
>
> 2018-08-03 15:52:28.149 BST [4140] STATEMENT:  UPDATE intrinsiclink SET
> processid=$1,isnew=$2 WHERE jobid=$3 AND parentidhash=$4 AND linktype=$5
> AND childidhash=$6
>
> 2018-08-03 15:52:28.261 BST [5156] ERRO

[jira] [Commented] (CONNECTORS-1517) Documentum Connector uses different "unconstrained" a_content_type filters depending on whether the Content Types tab has been edited

2018-08-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569769#comment-16569769
 ] 

Karl Wright commented on CONNECTORS-1517:
-

Attached a second patch, to be applied in addition to the first one.  Sorry 
about that!


> Documentum Connector uses different "unconstrained" a_content_type filters 
> depending on whether the Content Types tab has been edited
> -
>
> Key: CONNECTORS-1517
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1517
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: CONNECTORS-1517-2.patch, CONNECTORS-1517.patch
>
>
> I am using Manifold 2.10 patched for issue 
> https://issues.apache.org/jira/browse/CONNECTORS-1512
> I find that the "unconstrained" query submitted to Documentum differs 
> depending on whether the Content Types in the job have been edited or not. 
> This can dramatically affect which files are fetched. After editing, there 
> are likely to be fewer.
> For example, having simply created a job connecting to DM and setting only 
> the Paths value to Administrator/james the following request is generated. 
> (Taken from manifoldcf.log).
> Note that there are no a_content_type constraints (and my line break for 
> readibility):
> {code:java}
> DEBUG 2018-07-26T05:52:56,422 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:52:56','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0))
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> Once the Content Types tab has been edited (e.g. to remove the 123w type) it 
> looks like this, i.e. the search constrains to only the selected types (my 
> ellipsis for readibility):
> {code:java}
> DEBUG 2018-07-26T05:58:36,755 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:58:36','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('acad', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> If the 123w type is now reselected in the Content Types tab, the search adds 
> it to the list of a_content_type entries, but doesn't return to the 
> unconstrained initial search:
> {code:java}
> DEBUG 2018-07-26T05:59:16,863 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:59:16','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('123w', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> This means that running what appears to be an equivalent job several times 
> may not fetch the same set of documents from Documentum.
> I expect that the same configuration in the UI produces the same search to 
> Documentum, regardless of how the configuration was arrived at.
> If the selected items in the Content Types list is treated as the only set of 
> files to fetch (i,.e. the initial unconstrained search is considered 
> incorrect here) then I guess I might also like to have flexibility to fetch 
> file types not on the checklist in the Content Types tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1517) Documentum Connector uses different "unconstrained" a_content_type filters depending on whether the Content Types tab has been edited

2018-08-06 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1517:

Attachment: CONNECTORS-1517-2.patch

> Documentum Connector uses different "unconstrained" a_content_type filters 
> depending on whether the Content Types tab has been edited
> -
>
> Key: CONNECTORS-1517
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1517
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: CONNECTORS-1517-2.patch, CONNECTORS-1517.patch
>
>
> I am using Manifold 2.10 patched for issue 
> https://issues.apache.org/jira/browse/CONNECTORS-1512
> I find that the "unconstrained" query submitted to Documentum differs 
> depending on whether the Content Types in the job have been edited or not. 
> This can dramatically affect which files are fetched. After editing, there 
> are likely to be fewer.
> For example, having simply created a job connecting to DM and setting only 
> the Paths value to Administrator/james the following request is generated. 
> (Taken from manifoldcf.log).
> Note that there are no a_content_type constraints (and my line break for 
> readibility):
> {code:java}
> DEBUG 2018-07-26T05:52:56,422 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:52:56','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0))
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> Once the Content Types tab has been edited (e.g. to remove the 123w type) it 
> looks like this, i.e. the search constrains to only the selected types (my 
> ellipsis for readibility):
> {code:java}
> DEBUG 2018-07-26T05:58:36,755 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:58:36','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('acad', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> If the 123w type is now reselected in the Content Types tab, the search adds 
> it to the list of a_content_type entries, but doesn't return to the 
> unconstrained initial search:
> {code:java}
> DEBUG 2018-07-26T05:59:16,863 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:59:16','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('123w', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> This means that running what appears to be an equivalent job several times 
> may not fetch the same set of documents from Documentum.
> I expect that the same configuration in the UI produces the same search to 
> Documentum, regardless of how the configuration was arrived at.
> If the selected items in the Content Types list is treated as the only set of 
> files to fetch (i,.e. the initial unconstrained search is considered 
> incorrect here) then I guess I might also like to have flexibility to fetch 
> file types not on the checklist in the Content Types tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1517) Documentum Connector uses different "unconstrained" a_content_type filters depending on whether the Content Types tab has been edited

2018-08-05 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1517.
-
Resolution: Fixed

tentative fix committed: r1837476


> Documentum Connector uses different "unconstrained" a_content_type filters 
> depending on whether the Content Types tab has been edited
> -
>
> Key: CONNECTORS-1517
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1517
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: CONNECTORS-1517.patch
>
>
> I am using Manifold 2.10 patched for issue 
> https://issues.apache.org/jira/browse/CONNECTORS-1512
> I find that the "unconstrained" query submitted to Documentum differs 
> depending on whether the Content Types in the job have been edited or not. 
> This can dramatically affect which files are fetched. After editing, there 
> are likely to be fewer.
> For example, having simply created a job connecting to DM and setting only 
> the Paths value to Administrator/james the following request is generated. 
> (Taken from manifoldcf.log).
> Note that there are no a_content_type constraints (and my line break for 
> readibility):
> {code:java}
> DEBUG 2018-07-26T05:52:56,422 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:52:56','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0))
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> Once the Content Types tab has been edited (e.g. to remove the 123w type) it 
> looks like this, i.e. the search constrains to only the selected types (my 
> ellipsis for readibility):
> {code:java}
> DEBUG 2018-07-26T05:58:36,755 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:58:36','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('acad', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> If the 123w type is now reselected in the Content Types tab, the search adds 
> it to the list of a_content_type entries, but doesn't return to the 
> unconstrained initial search:
> {code:java}
> DEBUG 2018-07-26T05:59:16,863 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:59:16','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('123w', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> This means that running what appears to be an equivalent job several times 
> may not fetch the same set of documents from Documentum.
> I expect that the same configuration in the UI produces the same search to 
> Documentum, regardless of how the configuration was arrived at.
> If the selected items in the Content Types list is treated as the only set of 
> files to fetch (i,.e. the initial unconstrained search is considered 
> incorrect here) then I guess I might also like to have flexibility to fetch 
> file types not on the checklist in the Content Types tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1517) Documentum Connector uses different "unconstrained" a_content_type filters depending on whether the Content Types tab has been edited

2018-08-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569756#comment-16569756
 ] 

Karl Wright commented on CONNECTORS-1517:
-

[~jamesthomas], I've coded a tentative patch, but haven't been able to exercise 
it since I have no documentum setup here.  Please let me know if it works for 
you.


> Documentum Connector uses different "unconstrained" a_content_type filters 
> depending on whether the Content Types tab has been edited
> -
>
> Key: CONNECTORS-1517
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1517
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: CONNECTORS-1517.patch
>
>
> I am using Manifold 2.10 patched for issue 
> https://issues.apache.org/jira/browse/CONNECTORS-1512
> I find that the "unconstrained" query submitted to Documentum differs 
> depending on whether the Content Types in the job have been edited or not. 
> This can dramatically affect which files are fetched. After editing, there 
> are likely to be fewer.
> For example, having simply created a job connecting to DM and setting only 
> the Paths value to Administrator/james the following request is generated. 
> (Taken from manifoldcf.log).
> Note that there are no a_content_type constraints (and my line break for 
> readibility):
> {code:java}
> DEBUG 2018-07-26T05:52:56,422 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:52:56','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0))
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> Once the Content Types tab has been edited (e.g. to remove the 123w type) it 
> looks like this, i.e. the search constrains to only the selected types (my 
> ellipsis for readibility):
> {code:java}
> DEBUG 2018-07-26T05:58:36,755 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:58:36','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('acad', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> If the 123w type is now reselected in the Content Types tab, the search adds 
> it to the list of a_content_type entries, but doesn't return to the 
> unconstrained initial search:
> {code:java}
> DEBUG 2018-07-26T05:59:16,863 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:59:16','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('123w', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> This means that running what appears to be an equivalent job several times 
> may not fetch the same set of documents from Documentum.
> I expect that the same configuration in the UI produces the same search to 
> Documentum, regardless of how the configuration was arrived at.
> If the selected items in the Content Types list is treated as the only set of 
> files to fetch (i,.e. the initial unconstrained search is considered 
> incorrect here) then I guess I might also like to have flexibility to fetch 
> file types not on the checklist in the Content Types tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1517) Documentum Connector uses different "unconstrained" a_content_type filters depending on whether the Content Types tab has been edited

2018-08-05 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1517:

Attachment: CONNECTORS-1517.patch

> Documentum Connector uses different "unconstrained" a_content_type filters 
> depending on whether the Content Types tab has been edited
> -
>
> Key: CONNECTORS-1517
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1517
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: CONNECTORS-1517.patch
>
>
> I am using Manifold 2.10 patched for issue 
> https://issues.apache.org/jira/browse/CONNECTORS-1512
> I find that the "unconstrained" query submitted to Documentum differs 
> depending on whether the Content Types in the job have been edited or not. 
> This can dramatically affect which files are fetched. After editing, there 
> are likely to be fewer.
> For example, having simply created a job connecting to DM and setting only 
> the Paths value to Administrator/james the following request is generated. 
> (Taken from manifoldcf.log).
> Note that there are no a_content_type constraints (and my line break for 
> readibility):
> {code:java}
> DEBUG 2018-07-26T05:52:56,422 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:52:56','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0))
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> Once the Content Types tab has been edited (e.g. to remove the 123w type) it 
> looks like this, i.e. the search constrains to only the selected types (my 
> ellipsis for readibility):
> {code:java}
> DEBUG 2018-07-26T05:58:36,755 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:58:36','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('acad', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> If the 123w type is now reselected in the Content Types tab, the search adds 
> it to the list of a_content_type entries, but doesn't return to the 
> unconstrained initial search:
> {code:java}
> DEBUG 2018-07-26T05:59:16,863 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:59:16','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('123w', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> This means that running what appears to be an equivalent job several times 
> may not fetch the same set of documents from Documentum.
> I expect that the same configuration in the UI produces the same search to 
> Documentum, regardless of how the configuration was arrived at.
> If the selected items in the Content Types list is treated as the only set of 
> files to fetch (i,.e. the initial unconstrained search is considered 
> incorrect here) then I guess I might also like to have flexibility to fetch 
> file types not on the checklist in the Content Types tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1517) Documentum Connector uses different "unconstrained" a_content_type filters depending on whether the Content Types tab has been edited

2018-08-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569724#comment-16569724
 ] 

Karl Wright commented on CONNECTORS-1517:
-

That's unfortunate, because I don't know DQL either.  And I especially don't 
know the DQL for finding documents that are either missing their mime type 
field altogether or have one that's not listed.

I can certainly say this though: the way the form posting works for this 
particular selection is a bit wonky, but seem to guarantee that it works 
consistently PROVIDED the job is edited in the UI at least once.  If nothing 
has been selected at all, then the form is displayed with all boxes checked.  
The very first time the form is reposted or saved, all the checked boxes are 
gathered and become part of the mime spec.  So you'd think that if you wanted 
to achieve the original behavior, you just uncheck everything -- but that 
doesn't work because that's explicitly blocked and won't clear out the old 
stuff.

I think the first order of business is making the form work properly.  Then we 
can look at making changes.


> Documentum Connector uses different "unconstrained" a_content_type filters 
> depending on whether the Content Types tab has been edited
> -
>
> Key: CONNECTORS-1517
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1517
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
>
> I am using Manifold 2.10 patched for issue 
> https://issues.apache.org/jira/browse/CONNECTORS-1512
> I find that the "unconstrained" query submitted to Documentum differs 
> depending on whether the Content Types in the job have been edited or not. 
> This can dramatically affect which files are fetched. After editing, there 
> are likely to be fewer.
> For example, having simply created a job connecting to DM and setting only 
> the Paths value to Administrator/james the following request is generated. 
> (Taken from manifoldcf.log).
> Note that there are no a_content_type constraints (and my line break for 
> readibility):
> {code:java}
> DEBUG 2018-07-26T05:52:56,422 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:52:56','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0))
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> Once the Content Types tab has been edited (e.g. to remove the 123w type) it 
> looks like this, i.e. the search constrains to only the selected types (my 
> ellipsis for readibility):
> {code:java}
> DEBUG 2018-07-26T05:58:36,755 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:58:36','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('acad', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> If the 123w type is now reselected in the Content Types tab, the search adds 
> it to the list of a_content_type entries, but doesn't return to the 
> unconstrained initial search:
> {code:java}
> DEBUG 2018-07-26T05:59:16,863 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:59:16','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('123w', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> This means that running what appears to be an equivalent job several times 
> may not fetch the same set of documents from Documentum.
> I expect that the same configuration in the UI produces the same search to 
> Documentum, regardless of how the configuration was arrived at.
> If the selected items in the Content Types list is treated as the only set of 
> files to fetch (i,.e. the initial unconstrained search is considered 
> incorrect here) then I guess I might also like to have flexibility to fetch 
> file types not on the checklist in the Content Types tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LUCENE-8445) RandomGeoPolygonTest.testCompareBigPolygons() failure

2018-08-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569594#comment-16569594
 ] 

Karl Wright commented on LUCENE-8445:
-

It worries me that the detection of identical planes needs to be extended.  
Usually that's a sign that the math isn't quite right.  But we were already 
1e12 times larger than we should have needed to be, so a factor of 5 more seems 
not much worse.

I don't have the time right now to do a more detailed analysis, so I guess we 
can go ahead and commit the patch if all tests continue to pass.  But if others 
fail we'll want to look in depth as to what is actually going wrong.


> RandomGeoPolygonTest.testCompareBigPolygons() failure
> -
>
> Key: LUCENE-8445
> URL: https://issues.apache.org/jira/browse/LUCENE-8445
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Steve Rowe
>Assignee: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8445.patch
>
>
> Failure from 
> [https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22590/], reproduces 
> for me on Java8:
> {noformat}
> Checking out Revision 2a41cbd192451f6e69ae2e6cccb7b2e26af2 
> (refs/remotes/origin/master)
> [...]
>[junit4] Suite: org.apache.lucene.spatial3d.geom.RandomGeoPolygonTest
>[junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=RandomGeoPolygonTest -Dtests.method=testCompareBigPolygons 
> -Dtests.seed=5444688174504C79 -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=pt-LU -Dtests.timezone=Pacific/Pago_Pago -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
>[junit4] FAILURE 0.23s J1 | RandomGeoPolygonTest.testCompareBigPolygons 
> {seed=[5444688174504C79:CC6BBA71B5FC82A6]} <<<
>[junit4]> Throwable #1: java.lang.AssertionError: 
>[junit4]> Standard polygon: GeoCompositePolygon: {[GeoConvexPolygon: 
> {planetmodel=PlanetModel.SPHERE, points=[[lat=1.0468214627857893E-8, 
> lon=8.413079957136915E-7([X=0.6461, Y=8.413079957135923E-7, 
> Z=1.0468214627857893E-8])], [lat=-0.3036468642757333, 
> lon=0.5616500855257733([X=0.80765773219, Y=0.508219108660839, 
> Z=-0.29900221630132817])], [lat=-0.17226782498440368, 
> lon=0.8641157866087514([X=0.6397020656700857, Y=0.7492646151846353, 
> Z=-0.1714170458549729])], [lat=0.591763073597, 
> lon=1.0258877306398073([X=0.43020057589183536, Y=0.7097594028504629, 
> Z=0.5578252903622132])], [lat=0.16341821264361944, 
> lon=0.04608724380526752([X=0.9856292512291138, Y=0.04545712432110151, 
> Z=0.16269182207472105])]], internalEdges={4}}, GeoConvexPolygon: 
> {planetmodel=PlanetModel.SPHERE, points=[[lat=1.0468214627857893E-8, 
> lon=8.413079957136915E-7([X=0.6461, Y=8.413079957135923E-7, 
> Z=1.0468214627857893E-8])], [lat=0.16341821264361944, 
> lon=0.04608724380526752([X=0.9856292512291138, Y=0.04545712432110151, 
> Z=0.16269182207472105])], [lat=1.5452567609928165E-12, 
> lon=5.5280224842135794E-12([X=1.0, Y=5.5280224842135794E-12, 
> Z=1.5452567609928165E-12])]], internalEdges={0, 2}}, GeoConvexPolygon: 
> {planetmodel=PlanetModel.SPHERE, points=[[lat=1.0468214627857893E-8, 
> lon=8.413079957136915E-7([X=0.6461, Y=8.413079957135923E-7, 
> Z=1.0468214627857893E-8])], [lat=1.5452567609928165E-12, 
> lon=5.5280224842135794E-12([X=1.0, Y=5.5280224842135794E-12, 
> Z=1.5452567609928165E-12])], [lat=-1.0E-323, lon=0.0([X=1.0, Y=0.0, 
> Z=-1.0E-323])]], internalEdges={0}}]}
>[junit4]> Large polygon: GeoComplexPolygon: 
> {planetmodel=PlanetModel.SPHERE, number of shapes=1, address=e0a76761, 
> testPoint=[lat=0.04032281608974351, 
> lon=0.33067345007473165([X=0.945055084899262, Y=0.3244161494642355, 
> Z=0.040311889968686655])], testPointInSet=true, shapes={ 
> {[lat=1.0468214627857893E-8, lon=8.413079957136915E-7([X=0.6461, 
> Y=8.413079957135923E-7, Z=1.0468214627857893E-8])], [lat=-0.3036468642757333, 
> lon=0.5616500855257733([X=0.80765773219, Y=0.508219108660839, 
> Z=-0.29900221630132817])], [lat=-0.17226782498440368, 
> lon=0.8641157866087514([X=0.6397020656700857, Y=0.7492646151846353, 
> Z=-0.1714170458549729])], [lat=0.591763073597, 
> lon=1.0258877306398073([X=0.43020057589183536, Y=0.7097594028504629, 
> Z=0.5578252903622132])], [lat=0.16341821264361944, 
> lon=0.04608724380526752([X=0.9856292512291138, Y=0.04545712432110151, 
> Z=0.16269182207472105])], [lat=1.5452567609928165E-12, 
> lon=5.5280224842135794E-12([X=1.0, Y=5.5280224842135794E-12, 
> Z=1.5452567609928165E-12])], [lat=-1.0E-323, l

[jira] [Assigned] (LUCENE-8445) RandomGeoPolygonTest.testCompareBigPolygons() failure

2018-08-05 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned LUCENE-8445:
---

Assignee: Ignacio Vera

> RandomGeoPolygonTest.testCompareBigPolygons() failure
> -
>
> Key: LUCENE-8445
> URL: https://issues.apache.org/jira/browse/LUCENE-8445
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Steve Rowe
>Assignee: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8445.patch
>
>
> Failure from 
> [https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22590/], reproduces 
> for me on Java8:
> {noformat}
> Checking out Revision 2a41cbd192451f6e69ae2e6cccb7b2e26af2 
> (refs/remotes/origin/master)
> [...]
>[junit4] Suite: org.apache.lucene.spatial3d.geom.RandomGeoPolygonTest
>[junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=RandomGeoPolygonTest -Dtests.method=testCompareBigPolygons 
> -Dtests.seed=5444688174504C79 -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=pt-LU -Dtests.timezone=Pacific/Pago_Pago -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
>[junit4] FAILURE 0.23s J1 | RandomGeoPolygonTest.testCompareBigPolygons 
> {seed=[5444688174504C79:CC6BBA71B5FC82A6]} <<<
>[junit4]> Throwable #1: java.lang.AssertionError: 
>[junit4]> Standard polygon: GeoCompositePolygon: {[GeoConvexPolygon: 
> {planetmodel=PlanetModel.SPHERE, points=[[lat=1.0468214627857893E-8, 
> lon=8.413079957136915E-7([X=0.6461, Y=8.413079957135923E-7, 
> Z=1.0468214627857893E-8])], [lat=-0.3036468642757333, 
> lon=0.5616500855257733([X=0.80765773219, Y=0.508219108660839, 
> Z=-0.29900221630132817])], [lat=-0.17226782498440368, 
> lon=0.8641157866087514([X=0.6397020656700857, Y=0.7492646151846353, 
> Z=-0.1714170458549729])], [lat=0.591763073597, 
> lon=1.0258877306398073([X=0.43020057589183536, Y=0.7097594028504629, 
> Z=0.5578252903622132])], [lat=0.16341821264361944, 
> lon=0.04608724380526752([X=0.9856292512291138, Y=0.04545712432110151, 
> Z=0.16269182207472105])]], internalEdges={4}}, GeoConvexPolygon: 
> {planetmodel=PlanetModel.SPHERE, points=[[lat=1.0468214627857893E-8, 
> lon=8.413079957136915E-7([X=0.6461, Y=8.413079957135923E-7, 
> Z=1.0468214627857893E-8])], [lat=0.16341821264361944, 
> lon=0.04608724380526752([X=0.9856292512291138, Y=0.04545712432110151, 
> Z=0.16269182207472105])], [lat=1.5452567609928165E-12, 
> lon=5.5280224842135794E-12([X=1.0, Y=5.5280224842135794E-12, 
> Z=1.5452567609928165E-12])]], internalEdges={0, 2}}, GeoConvexPolygon: 
> {planetmodel=PlanetModel.SPHERE, points=[[lat=1.0468214627857893E-8, 
> lon=8.413079957136915E-7([X=0.6461, Y=8.413079957135923E-7, 
> Z=1.0468214627857893E-8])], [lat=1.5452567609928165E-12, 
> lon=5.5280224842135794E-12([X=1.0, Y=5.5280224842135794E-12, 
> Z=1.5452567609928165E-12])], [lat=-1.0E-323, lon=0.0([X=1.0, Y=0.0, 
> Z=-1.0E-323])]], internalEdges={0}}]}
>[junit4]> Large polygon: GeoComplexPolygon: 
> {planetmodel=PlanetModel.SPHERE, number of shapes=1, address=e0a76761, 
> testPoint=[lat=0.04032281608974351, 
> lon=0.33067345007473165([X=0.945055084899262, Y=0.3244161494642355, 
> Z=0.040311889968686655])], testPointInSet=true, shapes={ 
> {[lat=1.0468214627857893E-8, lon=8.413079957136915E-7([X=0.6461, 
> Y=8.413079957135923E-7, Z=1.0468214627857893E-8])], [lat=-0.3036468642757333, 
> lon=0.5616500855257733([X=0.80765773219, Y=0.508219108660839, 
> Z=-0.29900221630132817])], [lat=-0.17226782498440368, 
> lon=0.8641157866087514([X=0.6397020656700857, Y=0.7492646151846353, 
> Z=-0.1714170458549729])], [lat=0.591763073597, 
> lon=1.0258877306398073([X=0.43020057589183536, Y=0.7097594028504629, 
> Z=0.5578252903622132])], [lat=0.16341821264361944, 
> lon=0.04608724380526752([X=0.9856292512291138, Y=0.04545712432110151, 
> Z=0.16269182207472105])], [lat=1.5452567609928165E-12, 
> lon=5.5280224842135794E-12([X=1.0, Y=5.5280224842135794E-12, 
> Z=1.5452567609928165E-12])], [lat=-1.0E-323, lon=0.0([X=1.0, Y=0.0, 
> Z=-1.0E-323])]}}
>[junit4]> Point: [lat=-8.763997112262326E-13, 
> lon=3.14159265358979([X=-1.0, Y=3.2310891488651735E-15, 
> Z=-8.763997112262326E-13])]
>[junit4]> WKT: POLYGON((32.18017946378854 
> -17.397683785381247,49.51018758330871 -9.870219317504647,58.77903721991479 
> 33.90553510354402,2.640604559432277 9.363173880050821,3.1673235739886286E-10 
> 8.853669066894417E-11,0.0 -5.7E-322,4.820339742500488E-5 
> 5.99784517213369E-7,32.1

Re: Jetty crash

2018-07-31 Thread Karl Wright
If you are running on Unix, and you overcommit memory (that is, allocate
more memory than the machine actually has), the OS will randomly start
killing processes if it gets tight on memory.

Karl


On Tue, Jul 31, 2018 at 9:34 AM msaunier  wrote:

> I think a program kill Jetty.
>
>
>
> How can I debug this? Any idea? Jetty have a log file?
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mardi 31 juillet 2018 15:32
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Jetty crash
>
>
>
> There must be a reason.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jul 31, 2018 at 8:18 AM msaunier  wrote:
>
> Hello Karl,
>
>
>
> Today and yesterday, I have an error with Jetty. Jetty crash for no reason.
>
>
>
> Error:
>
> ./start.sh : ligne 41 :   562 Processus arrêté  "$JAVA_HOME/bin/java"
> $OPTIONS org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner
>
>
>
> Thanks,
>
> Maxence,
>
>
>
>
>
>


Re: Jetty crash

2018-07-31 Thread Karl Wright
There must be a reason.

Karl


On Tue, Jul 31, 2018 at 8:18 AM msaunier  wrote:

> Hello Karl,
>
>
>
> Today and yesterday, I have an error with Jetty. Jetty crash for no reason.
>
>
>
> Error:
>
> ./start.sh : ligne 41 :   562 Processus arrêté  "$JAVA_HOME/bin/java"
> $OPTIONS org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner
>
>
>
> Thanks,
>
> Maxence,
>
>
>
>
>


Re: Scheduler not working as we expected

2018-07-31 Thread Karl Wright
Hi Vinay,

Dynamic rescan is meant for web-crawling and revisits already crawled
documents based on how often they have changed in the past.  It is
therefore wholly inappropriate for something like a file crawl, since
directory contents (one of the kinds of documents there are in a file
crawl) change very infrequently.

Instead, I recommend that you run complete crawls, non-dynamic.  You can
even run minimal crawls fairly often, which will pick up new and changed
documents, and run non-minimal crawls on a less frequent schedule to
capture deletions.

Thanks,
Karl


On Tue, Jul 31, 2018 at 4:05 AM VINAY Bengaluru 
wrote:

> Hi Karl,
>We have set up a scheduler for our jobs with input
> connector as file system and output connector as Solr.
> We have set up a scheduler as follows :
> Schedule type: Rescan documents dynamically
> Recrawl interval: blank
> Schedule time: appropriate times with job invocation as complete.
>
> We see that the job is not picking up documents at the scheduled intervals.
>
> Why the job doesn't pickup new docs at the scheduled interval? Anything
> wrong with our job configuration or our understanding?
>
> Thanks and regards,
> Vinay
>
>


[jira] [Commented] (TIKA-2693) Tika 1.17 uses the wrong classloader for reflection

2018-07-30 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562606#comment-16562606
 ] 

Karl Wright commented on TIKA-2693:
---

[~kiwiwings], when is Tika planning to go to POI 4.0.0?


> Tika 1.17 uses the wrong classloader for reflection
> ---
>
> Key: TIKA-2693
> URL: https://issues.apache.org/jira/browse/TIKA-2693
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.17
>    Reporter: Karl Wright
>Priority: Major
>
> I don't know whether this was addressed in 1.18, but Tika seemingly uses the 
> wrong classloader when loading some classes by reflection.
> In ManifoldCF, there's a two-tiered classloader hierarchy.  Tika runs in the 
> higher class level.  Its expectation is that classes that are loaded via 
> reflection use the classloader associated with the class that is resolving 
> the reflection, NOT the thread classloader.  That's standard Java practice.
> But apparently there's a place where Tika doesn't do it that way:
> {code}
> Error tossed: org/apache/poi/POIXMLTextExtractor
> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTextExtractor
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>  ~[?:?]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[?:?]
> at 
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>  ~[?:?]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1520) Connector registration/deregistration fails when more than a certain number of jobs

2018-07-30 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1520:

Attachment: CONNECTORS-1520-2.patch

> Connector registration/deregistration fails when more than a certain number 
> of jobs
> ---
>
> Key: CONNECTORS-1520
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1520
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework agents process
>Affects Versions: ManifoldCF 2.10
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: CONNECTORS-1520-2.patch, CONNECTORS-1520.patch
>
>
> Cut-and-paste error defeated limits on the number of jobs updated at one time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
Ok, attached a second fix.

Karl


On Mon, Jul 30, 2018 at 4:09 PM Karl Wright  wrote:

> Yes, of course.  I overlooked that.  Will fix.
>
> Karl
>
>
> On Mon, Jul 30, 2018 at 3:54 PM Mike Hugo  wrote:
>
>> That limit only applies to the list of transformations, not the list of
>> job IDs.  If you follow the code into the next method
>>
>> >>>>>>
>>   /** Note registration for a batch of transformation connection names.
>>   */
>>   protected void noteTransformationConnectionRegistration(List
>> list)
>> throws ManifoldCFException
>>   {
>> // Query for the matching jobs, and then for each job potentially
>> adjust the state
>> Long[] jobIDs = jobs.findJobsMatchingTransformations(list);
>> <<<<<<
>>
>> Even if "list" is only 1 item, findJobsMatchingTransformations may
>> return thousands of jobIDs, which is then passed to the query a few lines
>> later:
>>
>> >>>>>>
>>   query.append("SELECT
>> ").append(jobs.idField).append(",").append(jobs.statusField)
>>   .append(" FROM ").append(jobs.getTableName()).append(" WHERE ")
>>   .append(database.buildConjunctionClause(newList,new
>> ClauseDescription[]{
>> new MultiClause(jobs.idField,jobIDs)}))
>>   .append(" FOR UPDATE");
>> <<<<<<
>>
>> Which generates a query with a large OR clause
>>
>>
>> Mike
>>
>> On Mon, Jul 30, 2018 at 2:44 PM, Karl Wright  wrote:
>>
>>> The limit is applied in the method that calls
>>> noteTransformationConnectionRegistration.
>>>
>>> Here it is:
>>>
>>> >>>>>>
>>>   /** Note the registration of a transformation connector used by the
>>> specified connections.
>>>   * This method will be called when a connector is registered, on which
>>> the specified
>>>   * connections depend.
>>>   *@param connectionNames is the set of connection names.
>>>   */
>>>   @Override
>>>   public void noteTransformationConnectorRegistration(String[]
>>> connectionNames)
>>> throws ManifoldCFException
>>>   {
>>> // For each connection, find the corresponding list of jobs.  From
>>> these jobs, we want the job id and the status.
>>> List list = new ArrayList();
>>> int maxCount = database.findConjunctionClauseMax(new
>>> ClauseDescription[]{});
>>> int currentCount = 0;
>>> int i = 0;
>>> while (i < connectionNames.length)
>>> {
>>>   if (currentCount == maxCount)
>>>   {
>>> noteTransformationConnectionRegistration(list);
>>> list.clear();
>>> currentCount = 0;
>>>   }
>>>
>>>   list.add(connectionNames[i++]);
>>>   currentCount++;
>>> }
>>> if (currentCount > 0)
>>>   noteTransformationConnectionRegistration(list);
>>>   }
>>> <<<<<<
>>>
>>> It looks correct now.  Do you see an issue with it?
>>>
>>> Karl
>>>
>>>
>>> On Mon, Jul 30, 2018 at 3:28 PM Mike Hugo  wrote:
>>>
>>>> Nice catch Karl!
>>>>
>>>> I applied that patch, but I'm still getting the same error.
>>>>
>>>> I think the problem is in JobManager.noteTransformationConnectionRe
>>>> gistration
>>>>
>>>> If jobs.findJobsMatchingTransformations(list); returns a large list of
>>>> ids (like it is doing in our case - 39,941 ids ), the generated query
>>>> string still has a large OR clause in it.  I don't see getMaxOrClause
>>>> applied to the query being built inside noteTransformationConnectionRe
>>>> gistration
>>>>
>>>> >>>>>>
>>>>  protected void noteTransformationConnectionRegistration(List
>>>> list)
>>>> throws ManifoldCFException
>>>>   {
>>>> // Query for the matching jobs, and then for each job potentially
>>>> adjust the state
>>>> Long[] jobIDs = jobs.findJobsMatchingTransformations(list);
>>>> if (jobIDs.length == 0)
>>>>   return;
>>>>
>>>> StringBuilder query = new StringBuilder();
>>>> ArrayList newList = new ArrayList();
>>>>
>>>> 

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
Yes, of course.  I overlooked that.  Will fix.

Karl


On Mon, Jul 30, 2018 at 3:54 PM Mike Hugo  wrote:

> That limit only applies to the list of transformations, not the list of
> job IDs.  If you follow the code into the next method
>
> >>>>>>
>   /** Note registration for a batch of transformation connection names.
>   */
>   protected void noteTransformationConnectionRegistration(List
> list)
> throws ManifoldCFException
>   {
> // Query for the matching jobs, and then for each job potentially
> adjust the state
> Long[] jobIDs = jobs.findJobsMatchingTransformations(list);
> <<<<<<
>
> Even if "list" is only 1 item, findJobsMatchingTransformations may return
> thousands of jobIDs, which is then passed to the query a few lines later:
>
> >>>>>>
>   query.append("SELECT
> ").append(jobs.idField).append(",").append(jobs.statusField)
>   .append(" FROM ").append(jobs.getTableName()).append(" WHERE ")
>   .append(database.buildConjunctionClause(newList,new
> ClauseDescription[]{
> new MultiClause(jobs.idField,jobIDs)}))
>   .append(" FOR UPDATE");
> <<<<<<
>
> Which generates a query with a large OR clause
>
>
> Mike
>
> On Mon, Jul 30, 2018 at 2:44 PM, Karl Wright  wrote:
>
>> The limit is applied in the method that calls
>> noteTransformationConnectionRegistration.
>>
>> Here it is:
>>
>> >>>>>>
>>   /** Note the registration of a transformation connector used by the
>> specified connections.
>>   * This method will be called when a connector is registered, on which
>> the specified
>>   * connections depend.
>>   *@param connectionNames is the set of connection names.
>>   */
>>   @Override
>>   public void noteTransformationConnectorRegistration(String[]
>> connectionNames)
>> throws ManifoldCFException
>>   {
>> // For each connection, find the corresponding list of jobs.  From
>> these jobs, we want the job id and the status.
>> List list = new ArrayList();
>> int maxCount = database.findConjunctionClauseMax(new
>> ClauseDescription[]{});
>> int currentCount = 0;
>> int i = 0;
>> while (i < connectionNames.length)
>> {
>>   if (currentCount == maxCount)
>>   {
>> noteTransformationConnectionRegistration(list);
>> list.clear();
>> currentCount = 0;
>>   }
>>
>>   list.add(connectionNames[i++]);
>>   currentCount++;
>> }
>> if (currentCount > 0)
>>   noteTransformationConnectionRegistration(list);
>>   }
>> <<<<<<
>>
>> It looks correct now.  Do you see an issue with it?
>>
>> Karl
>>
>>
>> On Mon, Jul 30, 2018 at 3:28 PM Mike Hugo  wrote:
>>
>>> Nice catch Karl!
>>>
>>> I applied that patch, but I'm still getting the same error.
>>>
>>> I think the problem is in JobManager.noteTransformationConnectionRe
>>> gistration
>>>
>>> If jobs.findJobsMatchingTransformations(list); returns a large list of
>>> ids (like it is doing in our case - 39,941 ids ), the generated query
>>> string still has a large OR clause in it.  I don't see getMaxOrClause
>>> applied to the query being built inside noteTransformationConnectionRe
>>> gistration
>>>
>>> >>>>>>
>>>  protected void noteTransformationConnectionRegistration(List
>>> list)
>>> throws ManifoldCFException
>>>   {
>>> // Query for the matching jobs, and then for each job potentially
>>> adjust the state
>>> Long[] jobIDs = jobs.findJobsMatchingTransformations(list);
>>> if (jobIDs.length == 0)
>>>   return;
>>>
>>> StringBuilder query = new StringBuilder();
>>> ArrayList newList = new ArrayList();
>>>
>>>     query.append("SELECT
>>> ").append(jobs.idField).append(",").append(jobs.statusField)
>>>   .append(" FROM ").append(jobs.getTableName()).append(" WHERE ")
>>> *  .append(database.buildConjunctionClause(newList,new
>>> ClauseDescription[]{*
>>> *new MultiClause(jobs.idField,jobIDs)}))*
>>>   .append(" FOR UPDATE");
>>> IResultSet set =
>>> database.performQuery(query.toString(),newList,null,null);
>>> 

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
The limit is applied in the method that calls
noteTransformationConnectionRegistration.

Here it is:

>>>>>>
  /** Note the registration of a transformation connector used by the
specified connections.
  * This method will be called when a connector is registered, on which the
specified
  * connections depend.
  *@param connectionNames is the set of connection names.
  */
  @Override
  public void noteTransformationConnectorRegistration(String[]
connectionNames)
throws ManifoldCFException
  {
// For each connection, find the corresponding list of jobs.  From
these jobs, we want the job id and the status.
List list = new ArrayList();
int maxCount = database.findConjunctionClauseMax(new
ClauseDescription[]{});
int currentCount = 0;
int i = 0;
while (i < connectionNames.length)
{
  if (currentCount == maxCount)
  {
noteTransformationConnectionRegistration(list);
list.clear();
currentCount = 0;
  }

  list.add(connectionNames[i++]);
  currentCount++;
}
if (currentCount > 0)
  noteTransformationConnectionRegistration(list);
  }
<<<<<<

It looks correct now.  Do you see an issue with it?

Karl


On Mon, Jul 30, 2018 at 3:28 PM Mike Hugo  wrote:

> Nice catch Karl!
>
> I applied that patch, but I'm still getting the same error.
>
> I think the problem is in JobManager.noteTransformationConnectionRe
> gistration
>
> If jobs.findJobsMatchingTransformations(list); returns a large list of
> ids (like it is doing in our case - 39,941 ids ), the generated query
> string still has a large OR clause in it.  I don't see getMaxOrClause
> applied to the query being built inside noteTransformationConnectionRe
> gistration
>
> >>>>>>
>  protected void noteTransformationConnectionRegistration(List list)
> throws ManifoldCFException
>   {
> // Query for the matching jobs, and then for each job potentially
> adjust the state
> Long[] jobIDs = jobs.findJobsMatchingTransformations(list);
> if (jobIDs.length == 0)
>   return;
>
> StringBuilder query = new StringBuilder();
> ArrayList newList = new ArrayList();
>
> query.append("SELECT
> ").append(jobs.idField).append(",").append(jobs.statusField)
>   .append(" FROM ").append(jobs.getTableName()).append(" WHERE ")
> *  .append(database.buildConjunctionClause(newList,new
> ClauseDescription[]{*
> *new MultiClause(jobs.idField,jobIDs)}))*
>   .append(" FOR UPDATE");
> IResultSet set =
> database.performQuery(query.toString(),newList,null,null);
> int i = 0;
> while (i < set.getRowCount())
> {
>   IResultRow row = set.getRow(i++);
>   Long jobID = (Long)row.getValue(jobs.idField);
>   int statusValue =
> jobs.stringToStatus((String)row.getValue(jobs.statusField));
>   jobs.noteTransformationConnectorRegistration(jobID,statusValue);
> }
>   }
> <<<<<<
>
>
> On Mon, Jul 30, 2018 at 1:55 PM, Karl Wright  wrote:
>
>> The Postgresql driver supposedly limits this to 25 clauses at a pop:
>>
>> >>>>>>
>>   @Override
>>   public int getMaxOrClause()
>>   {
>> return 25;
>>   }
>>
>>   /* Calculate the number of values a particular clause can have, given
>> the values for all the other clauses.
>>   * For example, if in the expression x AND y AND z, x has 2 values and z
>> has 1, find out how many values x can legally have
>>   * when using the buildConjunctionClause() method below.
>>   */
>>   @Override
>>   public int findConjunctionClauseMax(ClauseDescription[]
>> otherClauseDescriptions)
>>   {
>> // This implementation uses "OR"
>> return getMaxOrClause();
>>   }
>> <<<<<<
>>
>> The problem is that there was a cut-and-paste error, with just
>> transformation connections, that defeated the limit.  I'll create a ticket
>> and attach a patch.  CONNECTORS-1520.
>>
>> Karl
>>
>>
>>
>>
>>
>> On Mon, Jul 30, 2018 at 2:29 PM Karl Wright  wrote:
>>
>>> Hi Mike,
>>>
>>> This might be the issue indeed.  I'll look into it.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Jul 30, 2018 at 2:26 PM Mike Hugo  wrote:
>>>
>>>> I'm not sure what the solution is yet, but I think I may have found the
>>>> culprit:
>>>>
>>>> JobManager.noteTransformationConnectionRegistration(List list)
>>>> is creating a pretty big query:
>>>>
>>>> SELECT id,sta

[jira] [Resolved] (CONNECTORS-1520) Connector registration/deregistration fails when more than a certain number of jobs

2018-07-30 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1520.
-
Resolution: Fixed

r1837084

> Connector registration/deregistration fails when more than a certain number 
> of jobs
> ---
>
> Key: CONNECTORS-1520
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1520
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework agents process
>Affects Versions: ManifoldCF 2.10
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: CONNECTORS-1520.patch
>
>
> Cut-and-paste error defeated limits on the number of jobs updated at one time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1520) Connector registration/deregistration fails when more than a certain number of jobs

2018-07-30 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1520:

Attachment: CONNECTORS-1520.patch

> Connector registration/deregistration fails when more than a certain number 
> of jobs
> ---
>
> Key: CONNECTORS-1520
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1520
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework agents process
>Affects Versions: ManifoldCF 2.10
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: CONNECTORS-1520.patch
>
>
> Cut-and-paste error defeated limits on the number of jobs updated at one time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (CONNECTORS-1520) Connector registration/deregistration fails when more than a certain number of jobs

2018-07-30 Thread Karl Wright (JIRA)
Karl Wright created CONNECTORS-1520:
---

 Summary: Connector registration/deregistration fails when more 
than a certain number of jobs
 Key: CONNECTORS-1520
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1520
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework agents process
Affects Versions: ManifoldCF 2.10
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 2.11


Cut-and-paste error defeated limits on the number of jobs updated at one time.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
The Postgresql driver supposedly limits this to 25 clauses at a pop:

>>>>>>
  @Override
  public int getMaxOrClause()
  {
return 25;
  }

  /* Calculate the number of values a particular clause can have, given the
values for all the other clauses.
  * For example, if in the expression x AND y AND z, x has 2 values and z
has 1, find out how many values x can legally have
  * when using the buildConjunctionClause() method below.
  */
  @Override
  public int findConjunctionClauseMax(ClauseDescription[]
otherClauseDescriptions)
  {
// This implementation uses "OR"
return getMaxOrClause();
  }
<<<<<<

The problem is that there was a cut-and-paste error, with just
transformation connections, that defeated the limit.  I'll create a ticket
and attach a patch.  CONNECTORS-1520.

Karl





On Mon, Jul 30, 2018 at 2:29 PM Karl Wright  wrote:

> Hi Mike,
>
> This might be the issue indeed.  I'll look into it.
>
> Karl
>
>
> On Mon, Jul 30, 2018 at 2:26 PM Mike Hugo  wrote:
>
>> I'm not sure what the solution is yet, but I think I may have found the
>> culprit:
>>
>> JobManager.noteTransformationConnectionRegistration(List list) is
>> creating a pretty big query:
>>
>> SELECT id,status FROM jobs WHERE  (id=? OR id=? OR id=? OR id=? 
>> OR id=?) FOR UPDATE
>>
>> replace the elipsis  with as list of 39,941 ids (it's a huge query when
>> it prints out)
>>
>> It seems that the database doesn't like that query and closes the
>> connection before returning with a response.
>>
>> As I mentioned this instance of manifold has nearly 40,000 web crawlers.
>> is that a high number for Manifold to handle?
>>
>> On Mon, Jul 30, 2018 at 10:58 AM, Karl Wright  wrote:
>>
>>> Well, I have absolutely no idea what is wrong and I've never seen
>>> anything like that before.  But postgres is complaining because the
>>> communication with the JDBC client is being interrupted by something.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Jul 30, 2018 at 10:39 AM Mike Hugo  wrote:
>>>
>>>> No, and manifold and postgres run on the same host.
>>>>
>>>> On Mon, Jul 30, 2018 at 9:35 AM, Karl Wright 
>>>> wrote:
>>>>
>>>>> ' LOG:  incomplete message from client'
>>>>>
>>>>> This shows a network issue.  Did your network configuration change
>>>>> recently?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Jul 30, 2018 at 9:59 AM Mike Hugo  wrote:
>>>>>
>>>>>> Tried a postgres vacuum and also a restart, but the problem
>>>>>> persists.  Here's the log again with some additional logging details 
>>>>>> added
>>>>>> (below)
>>>>>>
>>>>>> I tried running the last query from the logs against the database and
>>>>>> it works fine - I modified it to return a count and that also works.
>>>>>>
>>>>>> SELECT count(*) FROM jobs t1 WHERE EXISTS(SELECT 'x' FROM
>>>>>> jobpipelines WHERE t1.id=ownerid AND transformationname='Tika');
>>>>>>  count
>>>>>> ---
>>>>>>  39941
>>>>>> (1 row)
>>>>>>
>>>>>>
>>>>>> Is 39k jobs a high number?  I've run some other instances of Manifold
>>>>>> with more like 1,000 jobs and those seem to be working fine.  That's the
>>>>>> only thing I can think of that's different between this instance that 
>>>>>> won't
>>>>>> start and the others.  Any ideas?
>>>>>>
>>>>>> Thanks for your help!
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> LOG:  duration: 0.079 ms  parse : SELECT connectionname FROM
>>>>>> transformationconnections WHERE classname=$1
>>>>>> LOG:  duration: 0.079 ms  bind : SELECT connectionname FROM
>>>>>> transformationconnections WHERE classname=$1
>>>>>> DETAIL:  parameters: $1 =
>>>>>> 'org.apache.manifoldcf.agents.transformation.tika.TikaExtractor'
>>>>>> LOG:  duration: 0.017 ms  execute : SELECT connectionname
>>>>>> FROM transformationconnections WHERE classname=$1
>>>>>> DETAIL:  parameters: $1 =
>>>>>> 'org.apache.manifoldcf.agents.transformation.tika.TikaExtractor'
>>>>>> LOG

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
Hi Mike,

This might be the issue indeed.  I'll look into it.

Karl


On Mon, Jul 30, 2018 at 2:26 PM Mike Hugo  wrote:

> I'm not sure what the solution is yet, but I think I may have found the
> culprit:
>
> JobManager.noteTransformationConnectionRegistration(List list) is
> creating a pretty big query:
>
> SELECT id,status FROM jobs WHERE  (id=? OR id=? OR id=? OR id=? 
> OR id=?) FOR UPDATE
>
> replace the elipsis  with as list of 39,941 ids (it's a huge query when it
> prints out)
>
> It seems that the database doesn't like that query and closes the
> connection before returning with a response.
>
> As I mentioned this instance of manifold has nearly 40,000 web crawlers.
> is that a high number for Manifold to handle?
>
> On Mon, Jul 30, 2018 at 10:58 AM, Karl Wright  wrote:
>
>> Well, I have absolutely no idea what is wrong and I've never seen
>> anything like that before.  But postgres is complaining because the
>> communication with the JDBC client is being interrupted by something.
>>
>> Karl
>>
>>
>> On Mon, Jul 30, 2018 at 10:39 AM Mike Hugo  wrote:
>>
>>> No, and manifold and postgres run on the same host.
>>>
>>> On Mon, Jul 30, 2018 at 9:35 AM, Karl Wright  wrote:
>>>
>>>> ' LOG:  incomplete message from client'
>>>>
>>>> This shows a network issue.  Did your network configuration change
>>>> recently?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Jul 30, 2018 at 9:59 AM Mike Hugo  wrote:
>>>>
>>>>> Tried a postgres vacuum and also a restart, but the problem persists.
>>>>> Here's the log again with some additional logging details added (below)
>>>>>
>>>>> I tried running the last query from the logs against the database and
>>>>> it works fine - I modified it to return a count and that also works.
>>>>>
>>>>> SELECT count(*) FROM jobs t1 WHERE EXISTS(SELECT 'x' FROM jobpipelines
>>>>> WHERE t1.id=ownerid AND transformationname='Tika');
>>>>>  count
>>>>> ---
>>>>>  39941
>>>>> (1 row)
>>>>>
>>>>>
>>>>> Is 39k jobs a high number?  I've run some other instances of Manifold
>>>>> with more like 1,000 jobs and those seem to be working fine.  That's the
>>>>> only thing I can think of that's different between this instance that 
>>>>> won't
>>>>> start and the others.  Any ideas?
>>>>>
>>>>> Thanks for your help!
>>>>>
>>>>> Mike
>>>>>
>>>>> LOG:  duration: 0.079 ms  parse : SELECT connectionname FROM
>>>>> transformationconnections WHERE classname=$1
>>>>> LOG:  duration: 0.079 ms  bind : SELECT connectionname FROM
>>>>> transformationconnections WHERE classname=$1
>>>>> DETAIL:  parameters: $1 =
>>>>> 'org.apache.manifoldcf.agents.transformation.tika.TikaExtractor'
>>>>> LOG:  duration: 0.017 ms  execute : SELECT connectionname
>>>>> FROM transformationconnections WHERE classname=$1
>>>>> DETAIL:  parameters: $1 =
>>>>> 'org.apache.manifoldcf.agents.transformation.tika.TikaExtractor'
>>>>> LOG:  duration: 0.039 ms  parse : SELECT * FROM agents
>>>>> LOG:  duration: 0.040 ms  bind : SELECT * FROM agents
>>>>> LOG:  duration: 0.010 ms  execute : SELECT * FROM agents
>>>>> LOG:  duration: 0.084 ms  parse : SELECT id FROM jobs t1
>>>>> WHERE EXISTS(SELECT 'x' FROM jobpipelines WHERE t1.id=ownerid AND
>>>>> transformationname=$1)
>>>>> LOG:  duration: 0.359 ms  bind : SELECT id FROM jobs t1 WHERE
>>>>> EXISTS(SELECT 'x' FROM jobpipelines WHERE t1.id=ownerid AND
>>>>> transformationname=$1)
>>>>> DETAIL:  parameters: $1 = 'Tika'
>>>>> LOG:  duration: 77.622 ms  execute : SELECT id FROM jobs t1
>>>>> WHERE EXISTS(SELECT 'x' FROM jobpipelines WHERE t1.id=ownerid AND
>>>>> transformationname=$1)
>>>>> DETAIL:  parameters: $1 = 'Tika'
>>>>> LOG:  incomplete message from client
>>>>> LOG:  disconnection: session time: 0:00:06.574 user=REMOVED
>>>>> database=REMOVED host=127.0.0.1 port=45356
>>>>> >2018-07-30 12:36:09,415 [main] ERROR org.apache.manifoldcf.root -
>>>>> Exception: This connection has been closed.
>>>>> org.apa

Re: Scheduling Problem and the IBM Domino Connector

2018-07-30 Thread Karl Wright
I am not aware of any existing Domino connector.

Karl


On Mon, Jul 30, 2018 at 12:19 PM Cheng Zeng  wrote:

> Thank you very much for your reply. Your advice is very helpful.
>
> I am wondering if the MCF supports IBM Domino?
>
> Does anyone know if there are available libraries or API resource to
> extract documents from Domino server?
>
> Best wishes,
> Cheng
>
> On 30 Jul 2018, at 17:48, Karl Wright  wrote:
>
> Hi Cheng,
>
> Dynamic recrawl revisits documents based on the frequency that they
> changed in the past.   It is therefore hard to make any prediction about
> whether a document will be recrawled in a given time interval.  You need
> recrawls of existing directories in order to discover new documents in
> SharePoint.
>
> If you want more predictable crawling, I'd suggest doing standard minimal
> crawls on a fixed schedule.  That will pick up any new documents added.
> Then do full crawls (not dynamic) periodically (once a week?) to clean up
> any deleted documents.
>
> Thanks,
> Karl
>
>
> On Mon, Jul 30, 2018 at 4:35 AM Cheng Zeng  wrote:
>
>> Hi Karl,
>>
>>
>> I have a question about the schedule-related configuration in the job. I
>> have a continuously running job which crawls the documents in Sharepoint
>> 2013 and the job is supposed to re-crawl about 26,000 docs every 24
>> hours as configured, however, it seems that there are something wrong with
>> my configuration, as I found that the number of active documents is only
>> increased by 1 or 2  when there are about 20 new documents created in the
>> Sharepoint after the continuous job runs for over a few weeks. If I
>> restarted the job, there were more active documents found and the number of
>> active documents reflected the correct number of the documents in the
>> Sharepoint lists. It seems that the job is not re-scanning all the
>> documents. I suspect there is something wrong with my scheduling
>> configuration. Although I have read section about how to set up the
>> schedule-related configuration information at end-user-documentation at
>> https://manifoldcf.apache.org/release/release-2.10/en_US/end-user-documentation.html#jobs,
>> I am still confused by the incorrect number of active documents of the job
>> after the continuous job runs for a few weeks.  The version of mcf I am
>> using is 2.6.
>>
>>
>> My schedule configuration is as follows:
>>
>>
>> Schedule type: Rescan Documents Dynamically
>>
>> Recrawl interval (if continuous): 1440 minutes
>>
>> Maximum recrawl interval (if continuous): blank
>>
>> Expiration interval (if continuous): blank
>>
>> Reseed interval (if continuous): blank
>>
>>
>>
>> Scheduled time:
>>
>> schedule 1: Any day of week, 5am plus 0, every month of year on any day
>> of month Job invocation:complete
>>
>>  Maximum run time: 3000 minutes
>>
>>
>> schedule 2: Any day of week, 12pm plus 0, every month of year on any day
>> of month  Job invocation:complete
>>
>>  Maximum run time: 3000 minutes
>>
>>
>> The screenshot of the scheduling is attached.
>>
>>
>> Could you please give me some advice about the problem I face with.
>>
>>
>> BTW: Does MCF support Domino? Are there any methods to extract documents
>> from Domino?
>>
>>
>>
>> Best wishes,
>>
>>
>> Cheng
>>
>>
>>
>>


Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
Well, I have absolutely no idea what is wrong and I've never seen anything
like that before.  But postgres is complaining because the communication
with the JDBC client is being interrupted by something.

Karl


On Mon, Jul 30, 2018 at 10:39 AM Mike Hugo  wrote:

> No, and manifold and postgres run on the same host.
>
> On Mon, Jul 30, 2018 at 9:35 AM, Karl Wright  wrote:
>
>> ' LOG:  incomplete message from client'
>>
>> This shows a network issue.  Did your network configuration change
>> recently?
>>
>> Karl
>>
>>
>> On Mon, Jul 30, 2018 at 9:59 AM Mike Hugo  wrote:
>>
>>> Tried a postgres vacuum and also a restart, but the problem persists.
>>> Here's the log again with some additional logging details added (below)
>>>
>>> I tried running the last query from the logs against the database and it
>>> works fine - I modified it to return a count and that also works.
>>>
>>> SELECT count(*) FROM jobs t1 WHERE EXISTS(SELECT 'x' FROM jobpipelines
>>> WHERE t1.id=ownerid AND transformationname='Tika');
>>>  count
>>> ---
>>>  39941
>>> (1 row)
>>>
>>>
>>> Is 39k jobs a high number?  I've run some other instances of Manifold
>>> with more like 1,000 jobs and those seem to be working fine.  That's the
>>> only thing I can think of that's different between this instance that won't
>>> start and the others.  Any ideas?
>>>
>>> Thanks for your help!
>>>
>>> Mike
>>>
>>> LOG:  duration: 0.079 ms  parse : SELECT connectionname FROM
>>> transformationconnections WHERE classname=$1
>>> LOG:  duration: 0.079 ms  bind : SELECT connectionname FROM
>>> transformationconnections WHERE classname=$1
>>> DETAIL:  parameters: $1 =
>>> 'org.apache.manifoldcf.agents.transformation.tika.TikaExtractor'
>>> LOG:  duration: 0.017 ms  execute : SELECT connectionname FROM
>>> transformationconnections WHERE classname=$1
>>> DETAIL:  parameters: $1 =
>>> 'org.apache.manifoldcf.agents.transformation.tika.TikaExtractor'
>>> LOG:  duration: 0.039 ms  parse : SELECT * FROM agents
>>> LOG:  duration: 0.040 ms  bind : SELECT * FROM agents
>>> LOG:  duration: 0.010 ms  execute : SELECT * FROM agents
>>> LOG:  duration: 0.084 ms  parse : SELECT id FROM jobs t1 WHERE
>>> EXISTS(SELECT 'x' FROM jobpipelines WHERE t1.id=ownerid AND
>>> transformationname=$1)
>>> LOG:  duration: 0.359 ms  bind : SELECT id FROM jobs t1 WHERE
>>> EXISTS(SELECT 'x' FROM jobpipelines WHERE t1.id=ownerid AND
>>> transformationname=$1)
>>> DETAIL:  parameters: $1 = 'Tika'
>>> LOG:  duration: 77.622 ms  execute : SELECT id FROM jobs t1
>>> WHERE EXISTS(SELECT 'x' FROM jobpipelines WHERE t1.id=ownerid AND
>>> transformationname=$1)
>>> DETAIL:  parameters: $1 = 'Tika'
>>> LOG:  incomplete message from client
>>> LOG:  disconnection: session time: 0:00:06.574 user=REMOVED
>>> database=REMOVED host=127.0.0.1 port=45356
>>> >2018-07-30 12:36:09,415 [main] ERROR org.apache.manifoldcf.root -
>>> Exception: This connection has been closed.
>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: This
>>> connection has been closed.
>>> at
>>> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.reinterpretException(DBInterfacePostgreSQL.java:627)
>>> ~[mcf-core.jar:?]
>>> at
>>> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.rollbackCurrentTransaction(DBInterfacePostgreSQL.java:1296)
>>> ~[mcf-core.jar:?]
>>> at
>>> org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:368)
>>> ~[mcf-core.jar:?]
>>> at
>>> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1236)
>>> ~[mcf-core.jar:?]
>>> at
>>> org.apache.manifoldcf.crawler.system.ManifoldCF.registerConnectors(ManifoldCF.java:605)
>>> ~[mcf-pull-agent.jar:?]
>>> at
>>> org.apache.manifoldcf.crawler.system.ManifoldCF.reregisterAllConnectors(ManifoldCF.java:160)
>>> ~[mcf-pull-agent.jar:?]
>>> at
>>> org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:239)
>>> [mcf-jetty-runner.jar:?]
>>> Caused by: org.postgresql.util.PSQLException: This connection has been
>>> closed.
>>> at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:766)
>>> ~[postgresql-42.1.3.jar:42.1.3]
>>> at
>>>

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
ors(ManifoldCF.java:605)
> at
> org.apache.manifoldcf.crawler.system.ManifoldCF.reregisterAllConnectors(ManifoldCF.java:160)
> at
> org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:239)
> Caused by: org.postgresql.util.PSQLException: This connection has been
> closed.
> at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:766)
> at org.postgresql.jdbc.PgConnection.createStatement(PgConnection.java:1576)
> at org.postgresql.jdbc.PgConnection.createStatement(PgConnection.java:367)
> at org.apache.manifoldcf.core.database.Database.execute(Database.java:873)
> at
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:696)
> LOG:  disconnection: session time: 0:00:10.677 user=postgres
> database=template1 host=127.0.0.1 port=45354
>
>
>
> On Sun, Jul 29, 2018 at 8:09 AM, Karl Wright  wrote:
>
>> It looks to me like your database server is not happy.  Maybe it's out of
>> resources?  Not sure but a restart may be in order.
>>
>> Karl
>>
>>
>> On Sun, Jul 29, 2018 at 9:06 AM Mike Hugo  wrote:
>>
>>> Recently we started seeing this error when Manifold CF starts up.  We
>>> had been running Manifold CF with many web connectors and a few RSS feeds
>>> for a while and it had been working fine.  The server got rebooted and
>>> since then we started seeing this error. I'm not sure exactly what
>>> changed.  Any ideas as to where to start looking and how to fix this?
>>>
>>> Thanks!
>>>
>>> Mike
>>>
>>>
>>> Initial repository connections already created.
>>> Configuration file successfully read
>>> Successfully unregistered all domains
>>> Successfully unregistered all output connectors
>>> Successfully unregistered all transformation connectors
>>> Successfully unregistered all mapping connectors
>>> Successfully unregistered all authority connectors
>>> Successfully unregistered all repository connectors
>>> WARNING:  there is already a transaction in progress
>>> WARNING:  there is no transaction in progress
>>> Successfully registered output connector
>>> 'org.apache.manifoldcf.agents.output.solr.SolrConnector'
>>> WARNING:  there is already a transaction in progress
>>> WARNING:  there is no transaction in progress
>>> Successfully registered output connector
>>> 'org.apache.manifoldcf.agents.output.searchblox.SearchBloxConnector'
>>> WARNING:  there is already a transaction in progress
>>> WARNING:  there is no transaction in progress
>>> Successfully registered output connector
>>> 'org.apache.manifoldcf.agents.output.opensearchserver.OpenSearchServerConnector'
>>> WARNING:  there is already a transaction in progress
>>> WARNING:  there is no transaction in progress
>>> Successfully registered output connector
>>> 'org.apache.manifoldcf.agents.output.nullconnector.NullConnector'
>>> WARNING:  there is already a transaction in progress
>>> WARNING:  there is no transaction in progress
>>> Successfully registered output connector
>>> 'org.apache.manifoldcf.agents.output.kafka.KafkaOutputConnector'
>>> WARNING:  there is already a transaction in progress
>>> WARNING:  there is no transaction in progress
>>> Successfully registered output connector
>>> 'org.apache.manifoldcf.agents.output.hdfs.HDFSOutputConnector'
>>> WARNING:  there is already a transaction in progress
>>> WARNING:  there is no transaction in progress
>>> Successfully registered output connector
>>> 'org.apache.manifoldcf.agents.output.gts.GTSConnector'
>>> WARNING:  there is already a transaction in progress
>>> WARNING:  there is no transaction in progress
>>> Successfully registered output connector
>>> 'org.apache.manifoldcf.agents.output.filesystem.FileOutputConnector'
>>> WARNING:  there is already a transaction in progress
>>> WARNING:  there is no transaction in progress
>>> Successfully registered output connector
>>> 'org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector'
>>> WARNING:  there is already a transaction in progress
>>> WARNING:  there is no transaction in progress
>>> Successfully registered output connector
>>> 'org.apache.manifoldcf.agents.output.amazoncloudsearch.AmazonCloudSearchConnector'
>>> WARNING:  there is already a transaction in progress
>>> WARNING:  there is no transaction in progress
>>> Successfully registered transformation connector
>>> 'org.apache.manifoldcf.agents.transformation.tikaservi

Re: Scheduling Problem

2018-07-30 Thread Karl Wright
Hi Cheng,

Dynamic recrawl revisits documents based on the frequency that they changed
in the past.   It is therefore hard to make any prediction about whether a
document will be recrawled in a given time interval.  You need recrawls of
existing directories in order to discover new documents in SharePoint.

If you want more predictable crawling, I'd suggest doing standard minimal
crawls on a fixed schedule.  That will pick up any new documents added.
Then do full crawls (not dynamic) periodically (once a week?) to clean up
any deleted documents.

Thanks,
Karl


On Mon, Jul 30, 2018 at 4:35 AM Cheng Zeng  wrote:

> Hi Karl,
>
>
> I have a question about the schedule-related configuration in the job. I
> have a continuously running job which crawls the documents in Sharepoint
> 2013 and the job is supposed to re-crawl about 26,000 docs every 24 hours
> as configured, however, it seems that there are something wrong with my
> configuration, as I found that the number of active documents is only
> increased by 1 or 2  when there are about 20 new documents created in the
> Sharepoint after the continuous job runs for over a few weeks. If I
> restarted the job, there were more active documents found and the number of
> active documents reflected the correct number of the documents in the
> Sharepoint lists. It seems that the job is not re-scanning all the
> documents. I suspect there is something wrong with my scheduling
> configuration. Although I have read section about how to set up the
> schedule-related configuration information at end-user-documentation at
> https://manifoldcf.apache.org/release/release-2.10/en_US/end-user-documentation.html#jobs,
> I am still confused by the incorrect number of active documents of the job
> after the continuous job runs for a few weeks.  The version of mcf I am
> using is 2.6.
>
>
> My schedule configuration is as follows:
>
>
> Schedule type: Rescan Documents Dynamically
>
> Recrawl interval (if continuous): 1440 minutes
>
> Maximum recrawl interval (if continuous): blank
>
> Expiration interval (if continuous): blank
>
> Reseed interval (if continuous): blank
>
>
>
> Scheduled time:
>
> schedule 1: Any day of week, 5am plus 0, every month of year on any day of
> month Job invocation:complete
>
>  Maximum run time: 3000 minutes
>
>
> schedule 2: Any day of week, 12pm plus 0, every month of year on any day
> of month  Job invocation:complete
>
>  Maximum run time: 3000 minutes
>
>
> The screenshot of the scheduling is attached.
>
>
> Could you please give me some advice about the problem I face with.
>
>
> BTW: Does MCF support Domino? Are there any methods to extract documents
> from Domino?
>
>
>
> Best wishes,
>
>
> Cheng
>
>
>
>


Re: PSQLException: This connection has been closed.

2018-07-29 Thread Karl Wright
It looks to me like your database server is not happy.  Maybe it's out of
resources?  Not sure but a restart may be in order.

Karl


On Sun, Jul 29, 2018 at 9:06 AM Mike Hugo  wrote:

> Recently we started seeing this error when Manifold CF starts up.  We had
> been running Manifold CF with many web connectors and a few RSS feeds for a
> while and it had been working fine.  The server got rebooted and since then
> we started seeing this error. I'm not sure exactly what changed.  Any ideas
> as to where to start looking and how to fix this?
>
> Thanks!
>
> Mike
>
>
> Initial repository connections already created.
> Configuration file successfully read
> Successfully unregistered all domains
> Successfully unregistered all output connectors
> Successfully unregistered all transformation connectors
> Successfully unregistered all mapping connectors
> Successfully unregistered all authority connectors
> Successfully unregistered all repository connectors
> WARNING:  there is already a transaction in progress
> WARNING:  there is no transaction in progress
> Successfully registered output connector
> 'org.apache.manifoldcf.agents.output.solr.SolrConnector'
> WARNING:  there is already a transaction in progress
> WARNING:  there is no transaction in progress
> Successfully registered output connector
> 'org.apache.manifoldcf.agents.output.searchblox.SearchBloxConnector'
> WARNING:  there is already a transaction in progress
> WARNING:  there is no transaction in progress
> Successfully registered output connector
> 'org.apache.manifoldcf.agents.output.opensearchserver.OpenSearchServerConnector'
> WARNING:  there is already a transaction in progress
> WARNING:  there is no transaction in progress
> Successfully registered output connector
> 'org.apache.manifoldcf.agents.output.nullconnector.NullConnector'
> WARNING:  there is already a transaction in progress
> WARNING:  there is no transaction in progress
> Successfully registered output connector
> 'org.apache.manifoldcf.agents.output.kafka.KafkaOutputConnector'
> WARNING:  there is already a transaction in progress
> WARNING:  there is no transaction in progress
> Successfully registered output connector
> 'org.apache.manifoldcf.agents.output.hdfs.HDFSOutputConnector'
> WARNING:  there is already a transaction in progress
> WARNING:  there is no transaction in progress
> Successfully registered output connector
> 'org.apache.manifoldcf.agents.output.gts.GTSConnector'
> WARNING:  there is already a transaction in progress
> WARNING:  there is no transaction in progress
> Successfully registered output connector
> 'org.apache.manifoldcf.agents.output.filesystem.FileOutputConnector'
> WARNING:  there is already a transaction in progress
> WARNING:  there is no transaction in progress
> Successfully registered output connector
> 'org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector'
> WARNING:  there is already a transaction in progress
> WARNING:  there is no transaction in progress
> Successfully registered output connector
> 'org.apache.manifoldcf.agents.output.amazoncloudsearch.AmazonCloudSearchConnector'
> WARNING:  there is already a transaction in progress
> WARNING:  there is no transaction in progress
> Successfully registered transformation connector
> 'org.apache.manifoldcf.agents.transformation.tikaservice.TikaExtractor'
> WARNING:  there is already a transaction in progress
> LOG:  incomplete message from client
> >2018-07-29 13:02:06,659 [main] ERROR org.apache.manifoldcf.root -
> Exception: This connection has been closed.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: This connection
> has been closed.
> at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.reinterpretException(DBInterfacePostgreSQL.java:627)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.rollbackCurrentTransaction(DBInterfacePostgreSQL.java:1296)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:368)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1236)
> ~[mcf-core.jar:?]
> at
> org.apache.manifoldcf.crawler.system.ManifoldCF.registerConnectors(ManifoldCF.java:605)
> ~[mcf-pull-agent.jar:?]
> at
> org.apache.manifoldcf.crawler.system.ManifoldCF.reregisterAllConnectors(ManifoldCF.java:160)
> ~[mcf-pull-agent.jar:?]
> at
> org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:239)
> [mcf-jetty-runner.jar:?]
> Caused by: org.postgresql.util.PSQLException: This connection has been
> closed.
> at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:766)
> ~[postgresql-42.1.3.jar:42.1.3]
> at
> org.postgresql.jdbc.PgConnection.createStatement(PgConnection.java:1576)
> ~[postgresql-42.1.3.jar:42.1.3]
> at org.postgresql.jdbc.PgConnection.createStatement(PgConnection.java:367)
> ~[postgresql-42.1.3.jar:42.1.3]
> at 

[jira] [Commented] (CONNECTORS-1519) CLIENTPROTOCOLEXCEPTION is thrown with 2.10 -> ES 6.x.y

2018-07-27 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560364#comment-16560364
 ] 

Karl Wright commented on CONNECTORS-1519:
-

Can you have a look at what has changed?

ElasticSearch is a nightmare to support because its protocols and APIs change 
dramatically on every release.  I'm happy to have multiple connectors available 
for supporting it but I simply don't have time myself to develop them.  Any 
help you can provide is therefore welcome.



> CLIENTPROTOCOLEXCEPTION   is thrown with 2.10 -> ES 6.x.y
> ---
>
> Key: CONNECTORS-1519
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1519
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Priority: Major
>
> Investigating CLIENTPROTOCOLEXCEPTION when using 2.10 with ES 6.x.y
> More information to follow.
>  
>  
> |07-27-2018 17:53:19.010|Indexation 
> (ES)|file:/var/manifoldcf/corpus/14.html|CLIENTPROTOCOLEXCEPTION|38053|23|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Exclude files ~$*

2018-07-27 Thread Karl Wright
Can you view the job and include a screen shot of where this is displayed?
Thanks.

The exclusions are not regexps -- they are file specs.  The file specs have
special meanings for "*" (matches everything) and "?" (matches one
character).  You do not need to URL encode them.

If you enable connector debugging () you will see
statements like these, which should hint as to what went wrong:

Logging.connectors.debug("JCIFS: Checking '"+match+"'
against '"+fileName.substring(matchEnd-1)+"'");


Karl


On Fri, Jul 27, 2018 at 8:36 AM msaunier  wrote:

> Hi Karl,
>
>
>
> In my JCIFS connector, I want to configure an exclude condition if files
> name start with ~$*
>
> I have add the condition, but it does not working.
>
> I need to add: %7E%24* or a regex?
>
>
>
> Thanks,
>
>
>
> Maxence,
>


Re: Tika/POI bugs

2018-07-27 Thread Karl Wright
To solve your production problem I highly recommend limiting the size of
the docs fed to Tika, for a start.  But that is no guarantee, I understand.

Out of memory problems are very hard to get good forensics for because they
cause major disruptions to the running server.  You could turn on a degree
of logging so that you can see what documents are being processed at any
time by all threads, but that is pretty verbose.  In your properties.xml
file, add .  But I suspect that will generate far too much noise.
Still, it's the best I can offer.

Karl


On Fri, Jul 27, 2018 at 7:52 AM msaunier  wrote:

> Hi Karl,
>
>
>
> Okay. For the Out of Memory:
>
>
>
> This is the last day that I can go on to find out where the error comes
> from. After that, I should go into production to meet my deadlines.
>
> I hope to find time in the future to be able to fix this problem on this
> server, otherwise I could not index it. Unfortunately, it is very difficult
> to find the documents that cause this error. I did not find any trace in
> the database. Even in debug mode, it is difficult to find the problematic
> document. Maybe if I limit to 1 thread I could find it more easily, but I'm
> afraid the crawl is very long.
>
> Maybe you have an idea of ​​the best method to adopt to find this / these
> documents?
>
>
>
> Maxence
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* vendredi 27 juillet 2018 12:47
> *À :* dev ; user@manifoldcf.apache.org
> *Objet :* Tika/POI bugs
>
>
>
> Hi all,
>
>
>
> I've easily spent 40 hours over the last two weeks chasing down bugs in
> Apache Tika and POI.  The two kinds I see are "ClassNotFound" (due to usage
> of the wrong ClassLoader), and "OutOfMemoryError" (not clear what it is due
> to yet).
>
> I don't have enough time to create tickets directly in Tika for all
> possible documents where these failures occur, so I urge our users to
> create tickets DIRECTLY in the Tika project in Jira.  I guess you can let
> the Tika people create the POI tickets, if need be.  For OutOfMemory
> problems, please attach the file that causes the problem to the ticket, and
> also the amount of memory you gave the agents process.  For ClassNotFound
> problems, also include the stack trace.
>
>
>
> Thanks in advance,
>
> Karlx
>


Tika/POI bugs

2018-07-27 Thread Karl Wright
Hi all,

I've easily spent 40 hours over the last two weeks chasing down bugs in
Apache Tika and POI.  The two kinds I see are "ClassNotFound" (due to usage
of the wrong ClassLoader), and "OutOfMemoryError" (not clear what it is due
to yet).

I don't have enough time to create tickets directly in Tika for all
possible documents where these failures occur, so I urge our users to
create tickets DIRECTLY in the Tika project in Jira.  I guess you can let
the Tika people create the POI tickets, if need be.  For OutOfMemory
problems, please attach the file that causes the problem to the ticket, and
also the amount of memory you gave the agents process.  For ClassNotFound
problems, also include the stack trace.

Thanks in advance,
Karl


Tika/POI bugs

2018-07-27 Thread Karl Wright
Hi all,

I've easily spent 40 hours over the last two weeks chasing down bugs in
Apache Tika and POI.  The two kinds I see are "ClassNotFound" (due to usage
of the wrong ClassLoader), and "OutOfMemoryError" (not clear what it is due
to yet).

I don't have enough time to create tickets directly in Tika for all
possible documents where these failures occur, so I urge our users to
create tickets DIRECTLY in the Tika project in Jira.  I guess you can let
the Tika people create the POI tickets, if need be.  For OutOfMemory
problems, please attach the file that causes the problem to the ticket, and
also the amount of memory you gave the agents process.  For ClassNotFound
problems, also include the stack trace.

Thanks in advance,
Karl


Re: Job stuck internal http error 500

2018-07-27 Thread Karl Wright
I am afraid you will need to open a Tika ticket, and be prepared to attach
your file to it.

Thanks,

Karl


On Fri, Jul 27, 2018 at 6:04 AM Bisonti Mario 
wrote:

> It isn’t a memory problem because xls file bigger (30MB) have been
> processed.
>
>
>
> This file xlsm with many colors etc hang
>
> I could suppose that it is a tika/solr erro but I don’t know how to solve
> it
>
> ☹
>
>
>
> *Oggetto:* R: Job stuck internal http error 500
>
>
>
> Yes, I am using:
> /opt/manifoldcf/multiprocess-file-example-proprietary
> I set:
>
> sudo nano options.env.unix
>
> -Xms2048m
>
> -Xmx2048m
>
>
>
> But I obtain the same error.
>
> My doubt is that it could be a solr/tika problem.
>
> What could I do?
>
> I restrict the scan to a single file and I obtain the same error
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* venerdì 27 luglio 2018 11:36
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck internal http error 500
>
>
>
> I am presuming you are using the examples.  If so, edit the options file
> to grant more memory to you agents process by increasing the Xmx value.
>
>
>
> Karl
>
>
>
> On Fri, Jul 27, 2018, 3:04 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
> My job is stucking indexing an xlsx file of 38MB
>
>
>
> What could I do to solve my problem?
>
>
>
> In the following there is the error:
> 2018-07-27 08:55:15.562 WARN  (qtp1521083627-52) [   x:core_share]
> o.e.j.s.HttpChannel /solr/core_share/update/extract
>
> java.lang.OutOfMemoryError
>
> at
> java.base/java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:188)
>
> at
> java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:180)
>
> at
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:147)
>
> at
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:660)
>
> at java.base/java.lang.StringBuilder.append(StringBuilder.java:195)
>
> at
> org.apache.solr.handler.extraction.SolrContentHandler.characters(SolrContentHandler.java:302)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>
> at
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>
> at
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>
> at
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLTikaBodyPartHandler.run(OOXMLTikaBodyPartHandler.java:147)
>
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler.handleEndOfRun(OOXMLWordAndPowerPointTextHandler.java:468)
>
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler.endElement(OOXMLWordAndPowerPointTextHandler.java:450)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1714)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2879)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
>
&g

Re: Job stuck internal http error 500

2018-07-27 Thread Karl Wright
Although it is not clear what process you are talking about.  If solr ask
them.

Karl

On Fri, Jul 27, 2018, 5:36 AM Karl Wright  wrote:

> I am presuming you are using the examples.  If so, edit the options file
> to grant more memory to you agents process by increasing the Xmx value.
>
> Karl
>
> On Fri, Jul 27, 2018, 3:04 AM Bisonti Mario 
> wrote:
>
>> Hallo.
>>
>> My job is stucking indexing an xlsx file of 38MB
>>
>>
>>
>> What could I do to solve my problem?
>>
>>
>>
>> In the following there is the error:
>> 2018-07-27 08:55:15.562 WARN  (qtp1521083627-52) [   x:core_share]
>> o.e.j.s.HttpChannel /solr/core_share/update/extract
>>
>> java.lang.OutOfMemoryError
>>
>> at
>> java.base/java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:188)
>>
>> at
>> java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:180)
>>
>> at
>> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:147)
>>
>> at
>> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:660)
>>
>> at
>> java.base/java.lang.StringBuilder.append(StringBuilder.java:195)
>>
>> at
>> org.apache.solr.handler.extraction.SolrContentHandler.characters(SolrContentHandler.java:302)
>>
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>
>> at
>> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>>
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>
>> at
>> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>>
>> at
>> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>>
>> at
>> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>>
>> at
>> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>>
>> at
>> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>>
>> at
>> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>>
>> at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLTikaBodyPartHandler.run(OOXMLTikaBodyPartHandler.java:147)
>>
>> at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler.handleEndOfRun(OOXMLWordAndPowerPointTextHandler.java:468)
>>
>> at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler.endElement(OOXMLWordAndPowerPointTextHandler.java:450)
>>
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>>
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>>
>> at
>> java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
>>
>> at
>> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1714)
>>
>> at
>> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2879)
>>
>> at
>> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602)
>>
>> at
>> java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
>>
>> at
>> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:532)
>>
>> at
>> java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
>>
>> at
>> java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
>>
>> at
>> java.x

Re: Job stuck internal http error 500

2018-07-27 Thread Karl Wright
I am presuming you are using the examples.  If so, edit the options file to
grant more memory to you agents process by increasing the Xmx value.

Karl

On Fri, Jul 27, 2018, 3:04 AM Bisonti Mario  wrote:

> Hallo.
>
> My job is stucking indexing an xlsx file of 38MB
>
>
>
> What could I do to solve my problem?
>
>
>
> In the following there is the error:
> 2018-07-27 08:55:15.562 WARN  (qtp1521083627-52) [   x:core_share]
> o.e.j.s.HttpChannel /solr/core_share/update/extract
>
> java.lang.OutOfMemoryError
>
> at
> java.base/java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:188)
>
> at
> java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:180)
>
> at
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:147)
>
> at
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:660)
>
> at java.base/java.lang.StringBuilder.append(StringBuilder.java:195)
>
> at
> org.apache.solr.handler.extraction.SolrContentHandler.characters(SolrContentHandler.java:302)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
> at
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>
> at
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>
> at
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>
> at
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLTikaBodyPartHandler.run(OOXMLTikaBodyPartHandler.java:147)
>
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler.handleEndOfRun(OOXMLWordAndPowerPointTextHandler.java:468)
>
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler.endElement(OOXMLWordAndPowerPointTextHandler.java:450)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1714)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2879)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:532)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635)
>
> at
> java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:324)
>
> at java.xml/javax.xml.parsers.SAXParser.parse(SAXParser.java:197)
>
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleGeneralTextContainingPart(AbstractOOXMLExtractor.java:506)
>
> at
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processShapes(XSSFExcelExtractorDecorator.java:279)
>
> at
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:185)
>
> at
> 

[jira] [Commented] (CONNECTORS-1518) MCF shutting down when Tika is used

2018-07-26 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559269#comment-16559269
 ] 

Karl Wright commented on CONNECTORS-1518:
-

[~svanschalkwyk], we don't control how much memory Tika takes to do its content 
extraction.  All we can guarantee is that we feed the content to Tika in 
streamed form.  In some cases it will use more memory and may need to load the 
entire document into memory.

The amount of memory you should give MCF when Tika is involved is therefore a 
function of your largest document (hopefully controlled by Allowed Documents 
filtering) times the number of worker threads you have allocated, plus some 
constant amount for overhead.

You can perhaps prove this to yourself better by setting up a Tika service and 
using the Tika external transformer instead.


> MCF shutting down when Tika is used
> ---
>
> Key: CONNECTORS-1518
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1518
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.10
> Environment: Centos 7
> Prior to crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 1.8G 12G 98M 1.1G 13G
> Swap: 2.0G 0B 2.0G
> After crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 10G 4.0G 98M 1.1G 4.4G
> Swap: 2.0G 0B 2.0G
>  
> {{start-options.env.unix :}}
> {{-Xss500m}}
> {{-Xms1g}}
> {{-Xmx8g}}
> {{-Dorg.apache.manifoldcf.configfile=./properties.xml}}
> {{-Dorg.apache.manifoldcf.jettyshutdowntoken=secret_token}}
> {{-cp}}
> {{.:./lib/mcf-core.jar:./lib/mcf-agents.jar:./lib/mcf-pull-agent.jar:./lib/mcf-ui-core.jar:./lib/mcf-jetty-runner.jar:./lib/jetty-continuation-9.2.3.v20140905.jar:./lib/jetty-http-9.2.3.v20140905.jar:./lib/jetty-io-9.2.3.v20140905.jar:./lib/jetty-jndi-9.2.3.v20140905.jar:./lib/jetty-jsp-jdt-2.3.3.jar:./lib/jetty-plus-9.2.3.v20140905.jar:./lib/jetty-schemas-3.1.M0.jar:./lib/jetty-security-9.2.3.v20140905.jar:./lib/jetty-server-9.2.3.v20140905.jar:./lib/jetty-servlet-9.2.3.v20140905.jar:./lib/jetty-util-9.2.3.v20140905.jar:./lib/jetty-webapp-9.2.3.v20140905.jar:./lib/jetty-xml-9.2.3.v20140905.jar:./lib/hsqldb-2.3.2.jar:./lib/postgresql-42.1.3.jar:./lib/commons-codec-1.10.jar:./lib/commons-collections-3.2.1.jar:./lib/commons-collections4-4.1.jar:./lib/commons-discovery-0.5.jar:./lib/commons-el-1.0.jar:./lib/commons-exec-1.3.jar:./lib/commons-fileupload-1.2.2.jar:./lib/commons-io-2.5.jar:./lib/commons-lang-2.6.jar:./lib/commons-lang3-3.6.jar:./lib/commons-logging-1.2.jar:./lib/ecj-4.3.1.jar:./lib/gson-2.8.0.jar:./lib/guava-21.0.jar:./lib/httpclient-4.5.3.jar:./lib/httpcore-4.4.6.jar:./lib/jasper-6.0.35.jar:./lib/jasper-el-6.0.35.jar:./lib/javax.servlet-api-3.1.0.jar:./lib/jna-4.1.0.jar:./lib/jna-platform-4.1.0.jar:./lib/json-simple-1.1.1.jar:./lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:./lib/juli-6.0.35.jar:./lib/log4j-1.2-api-2.4.1.jar:./lib/log4j-api-2.4.1.jar:./lib/log4j-core-2.4.1.jar:./lib/mail-1.4.5.jar:./lib/serializer-2.7.1.jar:./lib/slf4j-api-1.7.24.jar:./lib/slf4j-simple-1.7.24.jar:./lib/velocity-1.7.jar:./lib/xalan-2.7.1.jar:./lib/xercesImpl-2.10.0.jar:./lib/xml-apis-1.4.01.jar:./lib/zookeeper-3.4.10.jar:}}
>Reporter: Steph van Schalkwyk
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: CONNECTORS-1518.patch
>
>
>   ```Jul 26, 2018 1:21:51 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: org.xerial's sqlite-jdbc is not loaded.
>  Please provide the jar on your classpath to parse sqlite files.
>  See tika-parsers/pom.xml for the correct version.
>  agents process ran out of memory - shutting down
>  java.lang.OutOfMemoryError: Java heap space
>  \{{ {{ at java.base/java.util.Arrays.copyOf(Arrays.java:3816)
>  \{{ {{ at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
>  \{{ {{ at java.base/java.util.BitSet.expandTo(BitSet.java:353)
>  \{{ {{ at java.base/java.util.BitSet.set(BitSet.java:448)
>  \{{ {{ at 
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>  \{{ {{ at 
> org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
&g

[jira] [Resolved] (CONNECTORS-1518) MCF shutting down when Tika is used

2018-07-26 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1518.
-
Resolution: Fixed

r1836769

> MCF shutting down when Tika is used
> ---
>
> Key: CONNECTORS-1518
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1518
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.10
> Environment: Centos 7
> Prior to crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 1.8G 12G 98M 1.1G 13G
> Swap: 2.0G 0B 2.0G
> After crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 10G 4.0G 98M 1.1G 4.4G
> Swap: 2.0G 0B 2.0G
>  
> {{start-options.env.unix :}}
> {{-Xss500m}}
> {{-Xms1g}}
> {{-Xmx8g}}
> {{-Dorg.apache.manifoldcf.configfile=./properties.xml}}
> {{-Dorg.apache.manifoldcf.jettyshutdowntoken=secret_token}}
> {{-cp}}
> {{.:./lib/mcf-core.jar:./lib/mcf-agents.jar:./lib/mcf-pull-agent.jar:./lib/mcf-ui-core.jar:./lib/mcf-jetty-runner.jar:./lib/jetty-continuation-9.2.3.v20140905.jar:./lib/jetty-http-9.2.3.v20140905.jar:./lib/jetty-io-9.2.3.v20140905.jar:./lib/jetty-jndi-9.2.3.v20140905.jar:./lib/jetty-jsp-jdt-2.3.3.jar:./lib/jetty-plus-9.2.3.v20140905.jar:./lib/jetty-schemas-3.1.M0.jar:./lib/jetty-security-9.2.3.v20140905.jar:./lib/jetty-server-9.2.3.v20140905.jar:./lib/jetty-servlet-9.2.3.v20140905.jar:./lib/jetty-util-9.2.3.v20140905.jar:./lib/jetty-webapp-9.2.3.v20140905.jar:./lib/jetty-xml-9.2.3.v20140905.jar:./lib/hsqldb-2.3.2.jar:./lib/postgresql-42.1.3.jar:./lib/commons-codec-1.10.jar:./lib/commons-collections-3.2.1.jar:./lib/commons-collections4-4.1.jar:./lib/commons-discovery-0.5.jar:./lib/commons-el-1.0.jar:./lib/commons-exec-1.3.jar:./lib/commons-fileupload-1.2.2.jar:./lib/commons-io-2.5.jar:./lib/commons-lang-2.6.jar:./lib/commons-lang3-3.6.jar:./lib/commons-logging-1.2.jar:./lib/ecj-4.3.1.jar:./lib/gson-2.8.0.jar:./lib/guava-21.0.jar:./lib/httpclient-4.5.3.jar:./lib/httpcore-4.4.6.jar:./lib/jasper-6.0.35.jar:./lib/jasper-el-6.0.35.jar:./lib/javax.servlet-api-3.1.0.jar:./lib/jna-4.1.0.jar:./lib/jna-platform-4.1.0.jar:./lib/json-simple-1.1.1.jar:./lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:./lib/juli-6.0.35.jar:./lib/log4j-1.2-api-2.4.1.jar:./lib/log4j-api-2.4.1.jar:./lib/log4j-core-2.4.1.jar:./lib/mail-1.4.5.jar:./lib/serializer-2.7.1.jar:./lib/slf4j-api-1.7.24.jar:./lib/slf4j-simple-1.7.24.jar:./lib/velocity-1.7.jar:./lib/xalan-2.7.1.jar:./lib/xercesImpl-2.10.0.jar:./lib/xml-apis-1.4.01.jar:./lib/zookeeper-3.4.10.jar:}}
>Reporter: Steph van Schalkwyk
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: CONNECTORS-1518.patch
>
>
>   ```Jul 26, 2018 1:21:51 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: org.xerial's sqlite-jdbc is not loaded.
>  Please provide the jar on your classpath to parse sqlite files.
>  See tika-parsers/pom.xml for the correct version.
>  agents process ran out of memory - shutting down
>  java.lang.OutOfMemoryError: Java heap space
>  \{{ {{ at java.base/java.util.Arrays.copyOf(Arrays.java:3816)
>  \{{ {{ at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
>  \{{ {{ at java.base/java.util.BitSet.expandTo(BitSet.java:353)
>  \{{ {{ at java.base/java.util.BitSet.set(BitSet.java:448)
>  \{{ {{ at 
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>  \{{ {{ at 
> org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandl

[jira] [Updated] (CONNECTORS-1518) MCF shutting down when Tika is used

2018-07-26 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1518:

Attachment: CONNECTORS-1518.patch

> MCF shutting down when Tika is used
> ---
>
> Key: CONNECTORS-1518
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1518
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.10
> Environment: Centos 7
> Prior to crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 1.8G 12G 98M 1.1G 13G
> Swap: 2.0G 0B 2.0G
> After crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 10G 4.0G 98M 1.1G 4.4G
> Swap: 2.0G 0B 2.0G
>  
> {{start-options.env.unix :}}
> {{-Xss500m}}
> {{-Xms1g}}
> {{-Xmx8g}}
> {{-Dorg.apache.manifoldcf.configfile=./properties.xml}}
> {{-Dorg.apache.manifoldcf.jettyshutdowntoken=secret_token}}
> {{-cp}}
> {{.:./lib/mcf-core.jar:./lib/mcf-agents.jar:./lib/mcf-pull-agent.jar:./lib/mcf-ui-core.jar:./lib/mcf-jetty-runner.jar:./lib/jetty-continuation-9.2.3.v20140905.jar:./lib/jetty-http-9.2.3.v20140905.jar:./lib/jetty-io-9.2.3.v20140905.jar:./lib/jetty-jndi-9.2.3.v20140905.jar:./lib/jetty-jsp-jdt-2.3.3.jar:./lib/jetty-plus-9.2.3.v20140905.jar:./lib/jetty-schemas-3.1.M0.jar:./lib/jetty-security-9.2.3.v20140905.jar:./lib/jetty-server-9.2.3.v20140905.jar:./lib/jetty-servlet-9.2.3.v20140905.jar:./lib/jetty-util-9.2.3.v20140905.jar:./lib/jetty-webapp-9.2.3.v20140905.jar:./lib/jetty-xml-9.2.3.v20140905.jar:./lib/hsqldb-2.3.2.jar:./lib/postgresql-42.1.3.jar:./lib/commons-codec-1.10.jar:./lib/commons-collections-3.2.1.jar:./lib/commons-collections4-4.1.jar:./lib/commons-discovery-0.5.jar:./lib/commons-el-1.0.jar:./lib/commons-exec-1.3.jar:./lib/commons-fileupload-1.2.2.jar:./lib/commons-io-2.5.jar:./lib/commons-lang-2.6.jar:./lib/commons-lang3-3.6.jar:./lib/commons-logging-1.2.jar:./lib/ecj-4.3.1.jar:./lib/gson-2.8.0.jar:./lib/guava-21.0.jar:./lib/httpclient-4.5.3.jar:./lib/httpcore-4.4.6.jar:./lib/jasper-6.0.35.jar:./lib/jasper-el-6.0.35.jar:./lib/javax.servlet-api-3.1.0.jar:./lib/jna-4.1.0.jar:./lib/jna-platform-4.1.0.jar:./lib/json-simple-1.1.1.jar:./lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:./lib/juli-6.0.35.jar:./lib/log4j-1.2-api-2.4.1.jar:./lib/log4j-api-2.4.1.jar:./lib/log4j-core-2.4.1.jar:./lib/mail-1.4.5.jar:./lib/serializer-2.7.1.jar:./lib/slf4j-api-1.7.24.jar:./lib/slf4j-simple-1.7.24.jar:./lib/velocity-1.7.jar:./lib/xalan-2.7.1.jar:./lib/xercesImpl-2.10.0.jar:./lib/xml-apis-1.4.01.jar:./lib/zookeeper-3.4.10.jar:}}
>Reporter: Steph van Schalkwyk
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: CONNECTORS-1518.patch
>
>
>   ```Jul 26, 2018 1:21:51 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: org.xerial's sqlite-jdbc is not loaded.
>  Please provide the jar on your classpath to parse sqlite files.
>  See tika-parsers/pom.xml for the correct version.
>  agents process ran out of memory - shutting down
>  java.lang.OutOfMemoryError: Java heap space
>  \{{ {{ at java.base/java.util.Arrays.copyOf(Arrays.java:3816)
>  \{{ {{ at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
>  \{{ {{ at java.base/java.util.BitSet.expandTo(BitSet.java:353)
>  \{{ {{ at java.base/java.util.BitSet.set(BitSet.java:448)
>  \{{ {{ at 
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>  \{{ {{ at 
> org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandl

[jira] [Commented] (CONNECTORS-1518) MCF shutting down when Tika is used

2018-07-26 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559082#comment-16559082
 ] 

Karl Wright commented on CONNECTORS-1518:
-

Hi [~svanschalkwyk], the memory usage for the ElasticSearch connector seemingly 
depends on whether the mapper attachment is used or not.  Here's the code.  
Note that everything is properly streamed when the mapper attachment is used, 
but is dumped into a string buffer when not:

{code}
if (useMapperAttachments && inputStream != null) {
  if(needComma){
pw.print(",");
  }
  // I'm told this is not necessary: see CONNECTORS-690
  //pw.print("\"type\" : \"attachment\",");
  pw.print("\"file\" : {");
  String contentType = document.getMimeType();
  if (contentType != null)
pw.print("\"_content_type\" : "+jsonStringEscape(contentType)+",");
  String fileName = document.getFileName();
  if (fileName != null)
pw.print("\"_name\" : "+jsonStringEscape(fileName)+",");
  // Since ES 1.0
  pw.print(" \"_content\" : \"");
  Base64 base64 = new Base64();
  base64.encodeStream(inputStream, pw);
  pw.print("\"}");
}

if (!useMapperAttachments && inputStream != null) {
  if (contentAttributeName != null)
  {
Reader r = new InputStreamReader(inputStream, Consts.UTF_8);
StringBuilder sb = new 
StringBuilder((int)document.getBinaryLength());
char[] buffer = new char[65536];
while (true)
{
  int amt = r.read(buffer,0,buffer.length);
  if (amt == -1)
break;
  sb.append(buffer,0,amt);
}
needComma = writeField(pw, needComma, contentAttributeName, new 
String[]{sb.toString()});
  }
}
{code}

The second clause therefore needs to be reworked to properly stream the content 
rather than going via the StringBuffer in that situation.

So is it correct to assume you're not using the mapper attachment?


> MCF shutting down when Tika is used
> ---
>
> Key: CONNECTORS-1518
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1518
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.10
> Environment: Centos 7
> Prior to crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 1.8G 12G 98M 1.1G 13G
> Swap: 2.0G 0B 2.0G
> After crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 10G 4.0G 98M 1.1G 4.4G
> Swap: 2.0G 0B 2.0G
>  
> {{start-options.env.unix :}}
> {{-Xss500m}}
> {{-Xms1g}}
> {{-Xmx8g}}
> {{-Dorg.apache.manifoldcf.configfile=./properties.xml}}
> {{-Dorg.apache.manifoldcf.jettyshutdowntoken=secret_token}}
> {{-cp}}
> {{.:./lib/mcf-core.jar:./lib/mcf-agents.jar:./lib/mcf-pull-agent.jar:./lib/mcf-ui-core.jar:./lib/mcf-jetty-runner.jar:./lib/jetty-continuation-9.2.3.v20140905.jar:./lib/jetty-http-9.2.3.v20140905.jar:./lib/jetty-io-9.2.3.v20140905.jar:./lib/jetty-jndi-9.2.3.v20140905.jar:./lib/jetty-jsp-jdt-2.3.3.jar:./lib/jetty-plus-9.2.3.v20140905.jar:./lib/jetty-schemas-3.1.M0.jar:./lib/jetty-security-9.2.3.v20140905.jar:./lib/jetty-server-9.2.3.v20140905.jar:./lib/jetty-servlet-9.2.3.v20140905.jar:./lib/jetty-util-9.2.3.v20140905.jar:./lib/jetty-webapp-9.2.3.v20140905.jar:./lib/jetty-xml-9.2.3.v20140905.jar:./lib/hsqldb-2.3.2.jar:./lib/postgresql-42.1.3.jar:./lib/commons-codec-1.10.jar:./lib/commons-collections-3.2.1.jar:./lib/commons-collections4-4.1.jar:./lib/commons-discovery-0.5.jar:./lib/commons-el-1.0.jar:./lib/commons-exec-1.3.jar:./lib/commons-fileupload-1.2.2.jar:./lib/commons-io-2.5.jar:./lib/commons-lang-2.6.jar:./lib/commons-lang3-3.6.jar:./lib/commons-logging-1.2.jar:./lib/ecj-4.3.1.jar:./lib/gson-2.8.0.jar:./lib/guava-21.0.jar:./lib/httpclient-4.5.3.jar:./lib/httpcore-4.4.6.jar:./lib/jasper-6.0.35.jar:./lib/jasper-el-6.0.35.jar:./lib/javax.servlet-api-3.1.0.jar:./lib/jna-4.1.0.jar:./lib/jna-platform-4.1.0.jar:./lib/json-simple-1.1.1.jar:./lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:./lib/juli-6.0.35.jar:./lib/log4j-1.2-api-2.4.1.jar:./lib/log4j-api-2.4.1.jar:./lib/log4j-core-2.4.1.jar:./lib/mail-1.4.5.jar:./lib/serializer-2.7.1.jar:./lib/slf4j-api-1.7.24.jar:./lib/slf4j-simple-1.7.24.jar:./lib/velocity-1.7.jar:./lib/xalan-2.7.1.jar:./lib/xercesImpl-2.10.0.jar:./lib/xml-apis-1.4.01.jar:./lib/zookeeper-3.4.10.jar:}}
>Reporter: Steph van Schalkwyk
>

[jira] [Updated] (CONNECTORS-1518) MCF shutting down when Tika is used

2018-07-26 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1518:

Fix Version/s: ManifoldCF 2.11

> MCF shutting down when Tika is used
> ---
>
> Key: CONNECTORS-1518
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1518
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.10
> Environment: Centos 7
> Prior to crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 1.8G 12G 98M 1.1G 13G
> Swap: 2.0G 0B 2.0G
> After crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 10G 4.0G 98M 1.1G 4.4G
> Swap: 2.0G 0B 2.0G
>  
> {{start-options.env.unix :}}
> {{-Xss500m}}
> {{-Xms1g}}
> {{-Xmx8g}}
> {{-Dorg.apache.manifoldcf.configfile=./properties.xml}}
> {{-Dorg.apache.manifoldcf.jettyshutdowntoken=secret_token}}
> {{-cp}}
> {{.:./lib/mcf-core.jar:./lib/mcf-agents.jar:./lib/mcf-pull-agent.jar:./lib/mcf-ui-core.jar:./lib/mcf-jetty-runner.jar:./lib/jetty-continuation-9.2.3.v20140905.jar:./lib/jetty-http-9.2.3.v20140905.jar:./lib/jetty-io-9.2.3.v20140905.jar:./lib/jetty-jndi-9.2.3.v20140905.jar:./lib/jetty-jsp-jdt-2.3.3.jar:./lib/jetty-plus-9.2.3.v20140905.jar:./lib/jetty-schemas-3.1.M0.jar:./lib/jetty-security-9.2.3.v20140905.jar:./lib/jetty-server-9.2.3.v20140905.jar:./lib/jetty-servlet-9.2.3.v20140905.jar:./lib/jetty-util-9.2.3.v20140905.jar:./lib/jetty-webapp-9.2.3.v20140905.jar:./lib/jetty-xml-9.2.3.v20140905.jar:./lib/hsqldb-2.3.2.jar:./lib/postgresql-42.1.3.jar:./lib/commons-codec-1.10.jar:./lib/commons-collections-3.2.1.jar:./lib/commons-collections4-4.1.jar:./lib/commons-discovery-0.5.jar:./lib/commons-el-1.0.jar:./lib/commons-exec-1.3.jar:./lib/commons-fileupload-1.2.2.jar:./lib/commons-io-2.5.jar:./lib/commons-lang-2.6.jar:./lib/commons-lang3-3.6.jar:./lib/commons-logging-1.2.jar:./lib/ecj-4.3.1.jar:./lib/gson-2.8.0.jar:./lib/guava-21.0.jar:./lib/httpclient-4.5.3.jar:./lib/httpcore-4.4.6.jar:./lib/jasper-6.0.35.jar:./lib/jasper-el-6.0.35.jar:./lib/javax.servlet-api-3.1.0.jar:./lib/jna-4.1.0.jar:./lib/jna-platform-4.1.0.jar:./lib/json-simple-1.1.1.jar:./lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:./lib/juli-6.0.35.jar:./lib/log4j-1.2-api-2.4.1.jar:./lib/log4j-api-2.4.1.jar:./lib/log4j-core-2.4.1.jar:./lib/mail-1.4.5.jar:./lib/serializer-2.7.1.jar:./lib/slf4j-api-1.7.24.jar:./lib/slf4j-simple-1.7.24.jar:./lib/velocity-1.7.jar:./lib/xalan-2.7.1.jar:./lib/xercesImpl-2.10.0.jar:./lib/xml-apis-1.4.01.jar:./lib/zookeeper-3.4.10.jar:}}
>Reporter: Steph van Schalkwyk
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
>
>   ```Jul 26, 2018 1:21:51 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: org.xerial's sqlite-jdbc is not loaded.
>  Please provide the jar on your classpath to parse sqlite files.
>  See tika-parsers/pom.xml for the correct version.
>  agents process ran out of memory - shutting down
>  java.lang.OutOfMemoryError: Java heap space
>  \{{ {{ at java.base/java.util.Arrays.copyOf(Arrays.java:3816)
>  \{{ {{ at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
>  \{{ {{ at java.base/java.util.BitSet.expandTo(BitSet.java:353)
>  \{{ {{ at java.base/java.util.BitSet.set(BitSet.java:448)
>  \{{ {{ at 
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>  \{{ {{ at 
> org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>  \{{ {{ at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHT

[jira] [Assigned] (CONNECTORS-1518) MCF shutting down when Tika is used

2018-07-26 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1518:
---

Assignee: Karl Wright

> MCF shutting down when Tika is used
> ---
>
> Key: CONNECTORS-1518
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1518
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.10
> Environment: Centos 7
> Prior to crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 1.8G 12G 98M 1.1G 13G
> Swap: 2.0G 0B 2.0G
> After crash:
> $free -h
>  total used free shared buff/cache available
> Mem: 15G 10G 4.0G 98M 1.1G 4.4G
> Swap: 2.0G 0B 2.0G
>  
> {{start-options.env.unix :}}
> {{-Xss500m}}
> {{-Xms1g}}
> {{-Xmx8g}}
> {{-Dorg.apache.manifoldcf.configfile=./properties.xml}}
> {{-Dorg.apache.manifoldcf.jettyshutdowntoken=secret_token}}
> {{-cp}}
> {{.:./lib/mcf-core.jar:./lib/mcf-agents.jar:./lib/mcf-pull-agent.jar:./lib/mcf-ui-core.jar:./lib/mcf-jetty-runner.jar:./lib/jetty-continuation-9.2.3.v20140905.jar:./lib/jetty-http-9.2.3.v20140905.jar:./lib/jetty-io-9.2.3.v20140905.jar:./lib/jetty-jndi-9.2.3.v20140905.jar:./lib/jetty-jsp-jdt-2.3.3.jar:./lib/jetty-plus-9.2.3.v20140905.jar:./lib/jetty-schemas-3.1.M0.jar:./lib/jetty-security-9.2.3.v20140905.jar:./lib/jetty-server-9.2.3.v20140905.jar:./lib/jetty-servlet-9.2.3.v20140905.jar:./lib/jetty-util-9.2.3.v20140905.jar:./lib/jetty-webapp-9.2.3.v20140905.jar:./lib/jetty-xml-9.2.3.v20140905.jar:./lib/hsqldb-2.3.2.jar:./lib/postgresql-42.1.3.jar:./lib/commons-codec-1.10.jar:./lib/commons-collections-3.2.1.jar:./lib/commons-collections4-4.1.jar:./lib/commons-discovery-0.5.jar:./lib/commons-el-1.0.jar:./lib/commons-exec-1.3.jar:./lib/commons-fileupload-1.2.2.jar:./lib/commons-io-2.5.jar:./lib/commons-lang-2.6.jar:./lib/commons-lang3-3.6.jar:./lib/commons-logging-1.2.jar:./lib/ecj-4.3.1.jar:./lib/gson-2.8.0.jar:./lib/guava-21.0.jar:./lib/httpclient-4.5.3.jar:./lib/httpcore-4.4.6.jar:./lib/jasper-6.0.35.jar:./lib/jasper-el-6.0.35.jar:./lib/javax.servlet-api-3.1.0.jar:./lib/jna-4.1.0.jar:./lib/jna-platform-4.1.0.jar:./lib/json-simple-1.1.1.jar:./lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:./lib/juli-6.0.35.jar:./lib/log4j-1.2-api-2.4.1.jar:./lib/log4j-api-2.4.1.jar:./lib/log4j-core-2.4.1.jar:./lib/mail-1.4.5.jar:./lib/serializer-2.7.1.jar:./lib/slf4j-api-1.7.24.jar:./lib/slf4j-simple-1.7.24.jar:./lib/velocity-1.7.jar:./lib/xalan-2.7.1.jar:./lib/xercesImpl-2.10.0.jar:./lib/xml-apis-1.4.01.jar:./lib/zookeeper-3.4.10.jar:}}
>Reporter: Steph van Schalkwyk
>Assignee: Karl Wright
>Priority: Major
>
>   ```Jul 26, 2018 1:21:51 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: org.xerial's sqlite-jdbc is not loaded.
>  Please provide the jar on your classpath to parse sqlite files.
>  See tika-parsers/pom.xml for the correct version.
>  agents process ran out of memory - shutting down
>  java.lang.OutOfMemoryError: Java heap space
>  \{{ {{ at java.base/java.util.Arrays.copyOf(Arrays.java:3816)
>  \{{ {{ at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
>  \{{ {{ at java.base/java.util.BitSet.expandTo(BitSet.java:353)
>  \{{ {{ at java.base/java.util.BitSet.set(BitSet.java:448)
>  \{{ {{ at 
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>  \{{ {{ at 
> org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>  \{{ {{ at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>  \{{ {{ at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLCont

[jira] [Commented] (CONNECTORS-1191) ManifoldCFException: Unexpected job status encountered

2018-07-26 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559073#comment-16559073
 ] 

Karl Wright commented on CONNECTORS-1191:
-

Hi [~svanschalkwyk], is there any reason you commented in this ticket? Your 
issue seems wholly unrelated.


> ManifoldCFException: Unexpected job status encountered
> --
>
> Key: CONNECTORS-1191
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1191
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.0.2
> Environment: - Debian 7.8 x86_64 GNU/Linux
> - Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)
> - Server version: 5.5.41-0+wheezy1 (Debian)
>Reporter: Arcadius Ahouansou
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 1.9, ManifoldCF 2.1
>
> Attachments: 1433374857580-jobs.png, 1433374857580-schedule.png, 
> CONNECTORS-1191-2.patch, CONNECTORS-1191.patch, manifoldcf2.0.2.log, 
> stuffer-thread-manifoldcf.log, unexpected-jobqueue.png
>
>
> Hello.
> I am running the latest ManifoldCF 2.0.2 and my log is filled of 
> {code}
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected job 
> status encountered: 34
>   at 
> org.apache.manifoldcf.crawler.jobs.Jobs.returnJobToActive(Jobs.java:2073)
>   at 
> org.apache.manifoldcf.crawler.jobs.JobManager.resetJobs(JobManager.java:8261)
>   at 
> org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:91)
> {code}
> I have attached full log for more detail.
> Note that I am running against MySQL.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
The way it works in the JCIFS connector is that files that aren't within
the specification are removed from the list of files being processed.  If a
file is already being processed, however, it is just retried.  So changing
this property to make an out-of-memory condition go away is not going to
work if you've already got a problem document being processed.

You can restart the job, and that will make it work.  Or you can add the
transformation connection instead.

FWIW, tou could verify if this was working properly if your simple history
was enabled.  Without that, you really can't.

Karl


On Thu, Jul 26, 2018 at 11:09 AM msaunier  wrote:

> On repository connection. I have add « 20971520 » on the max document size.
>
>
>
> Maxence
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* jeudi 26 juillet 2018 17:07
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: ***UNCHECKED*** Re: Out of memory, one file bug i think
>
>
>
> How are you limiting content size?  Is this in the repository connection,
> or in an Allowed Documents transformation connection?
>
>
>
> Karl
>
>
>
>
>
> On Thu, Jul 26, 2018 at 10:58 AM msaunier  wrote:
>
> I have limit to 20Mb / document and I have again an out of memory java.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* jeudi 26 juillet 2018 16:23
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: ***UNCHECKED*** Re: Out of memory, one file bug i think
>
>
>
> I believe there's also a content length tab in the Windows Share
> connector, if you're using that.
>
>
>
> Karl
>
>
>
>
>
> On Thu, Jul 26, 2018 at 10:19 AM Karl Wright  wrote:
>
> The ContentLimiter truncates documents.  That's not what you want.
>
>
>
> Use the Allowed Documents transformer.
>
>
>
> Karl
>
>
>
>
>
> On Thu, Jul 26, 2018 at 10:06 AM msaunier  wrote:
>
> I have add a Content limiter transformation before Tika extractor. It’s
> very very slow now. It’s normal?
>
>
>
> Maxence,
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mercredi 25 juillet 2018 19:15
> *À :* user@manifoldcf.apache.org
> *Objet :* ***UNCHECKED*** Re: Out of memory, one file bug i think
>
>
>
> It looks like you are still running out of memory.  I would love to know
> what document it was that doing that.  I suspect it is very large already,
> and for some reason it cannot be streamed.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jul 25, 2018 at 1:13 PM Karl Wright  wrote:
>
> Hi Maxence,
>
>
>
> The second exception is occurring because processing is still occurring
> while the JVM is shutting down; it can be ignored.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jul 25, 2018 at 1:01 PM msaunier  wrote:
>
> Hi Karl,
>
>
>
> I have add the snapshot and I’m spam with this error :
>
>
>
> FATAL 2018-07-25T16:43:04,599 (Worker thread '0') - Error tossed:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> java.lang.NoClassDefFoundError:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> at
> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:62)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getInputStream(ZipFileZipEntrySource.java:66)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:255)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:725) ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:238) ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:197)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:127)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
> ~[?:?]
>
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
> ~[?:?]
>
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> ~[?:?]
>
> at
> org.apa

Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
How are you limiting content size?  Is this in the repository connection,
or in an Allowed Documents transformation connection?

Karl


On Thu, Jul 26, 2018 at 10:58 AM msaunier  wrote:

> I have limit to 20Mb / document and I have again an out of memory java.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* jeudi 26 juillet 2018 16:23
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: ***UNCHECKED*** Re: Out of memory, one file bug i think
>
>
>
> I believe there's also a content length tab in the Windows Share
> connector, if you're using that.
>
>
>
> Karl
>
>
>
>
>
> On Thu, Jul 26, 2018 at 10:19 AM Karl Wright  wrote:
>
> The ContentLimiter truncates documents.  That's not what you want.
>
>
>
> Use the Allowed Documents transformer.
>
>
>
> Karl
>
>
>
>
>
> On Thu, Jul 26, 2018 at 10:06 AM msaunier  wrote:
>
> I have add a Content limiter transformation before Tika extractor. It’s
> very very slow now. It’s normal?
>
>
>
> Maxence,
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mercredi 25 juillet 2018 19:15
> *À :* user@manifoldcf.apache.org
> *Objet :* ***UNCHECKED*** Re: Out of memory, one file bug i think
>
>
>
> It looks like you are still running out of memory.  I would love to know
> what document it was that doing that.  I suspect it is very large already,
> and for some reason it cannot be streamed.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jul 25, 2018 at 1:13 PM Karl Wright  wrote:
>
> Hi Maxence,
>
>
>
> The second exception is occurring because processing is still occurring
> while the JVM is shutting down; it can be ignored.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jul 25, 2018 at 1:01 PM msaunier  wrote:
>
> Hi Karl,
>
>
>
> I have add the snapshot and I’m spam with this error :
>
>
>
> FATAL 2018-07-25T16:43:04,599 (Worker thread '0') - Error tossed:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> java.lang.NoClassDefFoundError:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> at
> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:62)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getInputStream(ZipFileZipEntrySource.java:66)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:255)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:725) ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:238) ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:197)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:127)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
> ~[?:?]
>
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
> ~[?:?]
>
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
> a

Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
I believe there's also a content length tab in the Windows Share connector,
if you're using that.

Karl


On Thu, Jul 26, 2018 at 10:19 AM Karl Wright  wrote:

> The ContentLimiter truncates documents.  That's not what you want.
>
> Use the Allowed Documents transformer.
>
> Karl
>
>
> On Thu, Jul 26, 2018 at 10:06 AM msaunier  wrote:
>
>> I have add a Content limiter transformation before Tika extractor. It’s
>> very very slow now. It’s normal?
>>
>>
>>
>> Maxence,
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddy...@gmail.com]
>> *Envoyé :* mercredi 25 juillet 2018 19:15
>> *À :* user@manifoldcf.apache.org
>> *Objet :* ***UNCHECKED*** Re: Out of memory, one file bug i think
>>
>>
>>
>> It looks like you are still running out of memory.  I would love to know
>> what document it was that doing that.  I suspect it is very large already,
>> and for some reason it cannot be streamed.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Wed, Jul 25, 2018 at 1:13 PM Karl Wright  wrote:
>>
>> Hi Maxence,
>>
>>
>>
>> The second exception is occurring because processing is still occurring
>> while the JVM is shutting down; it can be ignored.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Wed, Jul 25, 2018 at 1:01 PM msaunier  wrote:
>>
>> Hi Karl,
>>
>>
>>
>> I have add the snapshot and I’m spam with this error :
>>
>>
>>
>> FATAL 2018-07-25T16:43:04,599 (Worker thread '0') - Error tossed:
>> org/apache/commons/compress/utils/InputStreamStatistics
>>
>> java.lang.NoClassDefFoundError:
>> org/apache/commons/compress/utils/InputStreamStatistics
>>
>> at
>> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:62)
>> ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147)
>> ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34)
>> ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getInputStream(ZipFileZipEntrySource.java:66)
>> ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:255)
>> ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:725) ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:238) ~[?:?]
>>
>> at
>> org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:197)
>> ~[?:?]
>>
>> at
>> org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:127)
>> ~[?:?]
>>
>> at
>> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
>> ~[?:?]
>>
>> at
>> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
>> ~[?:?]
>>
>> at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
>> ~[?:?]
>>
>> at
>> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>> ~[?:?]
>>
>> at
>> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
>> ~[?:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
>> ~[mcf-agents.jar:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>> ~[mcf-agents.jar:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
>> ~[mcf-agents.jar:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
>> ~[mcf-agents.jar:?]
>>
>> at
>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
>> ~[mcf-pull-agent.jar:?]
>>
>> at
>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.inge

Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
The ContentLimiter truncates documents.  That's not what you want.

Use the Allowed Documents transformer.

Karl


On Thu, Jul 26, 2018 at 10:06 AM msaunier  wrote:

> I have add a Content limiter transformation before Tika extractor. It’s
> very very slow now. It’s normal?
>
>
>
> Maxence,
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mercredi 25 juillet 2018 19:15
> *À :* user@manifoldcf.apache.org
> *Objet :* ***UNCHECKED*** Re: Out of memory, one file bug i think
>
>
>
> It looks like you are still running out of memory.  I would love to know
> what document it was that doing that.  I suspect it is very large already,
> and for some reason it cannot be streamed.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jul 25, 2018 at 1:13 PM Karl Wright  wrote:
>
> Hi Maxence,
>
>
>
> The second exception is occurring because processing is still occurring
> while the JVM is shutting down; it can be ignored.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jul 25, 2018 at 1:01 PM msaunier  wrote:
>
> Hi Karl,
>
>
>
> I have add the snapshot and I’m spam with this error :
>
>
>
> FATAL 2018-07-25T16:43:04,599 (Worker thread '0') - Error tossed:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> java.lang.NoClassDefFoundError:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> at
> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:62)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getInputStream(ZipFileZipEntrySource.java:66)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:255)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:725) ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:238) ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:197)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:127)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
> ~[?:?]
>
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
> ~[?:?]
>
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> Maxence,
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mercredi 25 juillet 2018 13:12
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Out of memory, one file bug i think
>
>
>
> Hi Maxence,
>
>
>
> Tomorrow (7/26) the POI project will be delivering a nightly build which
> should rep

Re: Solr connection, max connections and CPU

2018-07-26 Thread Karl Wright
Hi Mario,

There is no connection between the number of CPUs and the number output
connections.  You pick the maximum number of output connections based on
the number of listening threads that you can use at the same time in Solr.

Karl

On Thu, Jul 26, 2018 at 9:22 AM Bisonti Mario 
wrote:

> Hallo, I setup solr connection in the “Output connections” of Manifold
>
>
>
> I don’t understand if there is a relation between “Max Connections” and
> the number of CPUs in the host.
>
>
>
> Could you help me ti understand it?
>
>
>
> Thanks a lot
>
> Mario
>


[jira] [Commented] (CONNECTORS-1516) Class not found exception using Tika transformer

2018-07-26 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558285#comment-16558285
 ] 

Karl Wright commented on CONNECTORS-1516:
-

Fix committed in Apache POI.  But now we see:

{code}
FATAL 2018-07-26T11:30:32,220 (Worker thread '28') - Error tossed: 
org/apache/poi/POIXMLTextExtractor

java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTextExtractor

at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) 
~[?:?]

at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[?:?]

at 
org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
 ~[?:?]

at 
org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
 ~[?:?]
{code}

New Tika bug report: TIKA-2693



> Class not found exception using Tika transformer
> 
>
> Key: CONNECTORS-1516
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1516
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.10
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
>
> See Bugzilla case: https://bz.apache.org/bugzilla/show_bug.cgi?id=62564



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2693) Tika 1.17 uses the wrong classloader for reflection

2018-07-26 Thread Karl Wright (JIRA)
Karl Wright created TIKA-2693:
-

 Summary: Tika 1.17 uses the wrong classloader for reflection
 Key: TIKA-2693
 URL: https://issues.apache.org/jira/browse/TIKA-2693
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.17
Reporter: Karl Wright


I don't know whether this was addressed in 1.18, but Tika seemingly uses the 
wrong classloader when loading some classes by reflection.

In ManifoldCF, there's a two-tiered classloader hierarchy.  Tika runs in the 
higher class level.  Its expectation is that classes that are loaded via 
reflection use the classloader associated with the class that is resolving the 
reflection, NOT the thread classloader.  That's standard Java practice.

But apparently there's a place where Tika doesn't do it that way:

{code}
Error tossed: org/apache/poi/POIXMLTextExtractor
java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTextExtractor
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) 
~[?:?]
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[?:?]
at 
org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
 ~[?:?]
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
Hi Maxence,

I am wondering whether you moved any jars from dist/connector-common-lib to
dist/lib?  If you did this, you will mess up the ability of any of the Tika
jars to find their dependencies.  This also explains why commons-compress
cannot be found; it's in connector-common-lib.  It sounds like you may have
put the new poi jars in the wrong place?  They should *all* be in
connector-common-lib too.

Karl


On Thu, Jul 26, 2018 at 6:23 AM Karl Wright  wrote:

> Hi Maxence,
>
> The following error:
>
> >>>>>>
>
> FATAL 2018-07-26T11:30:32,220 (Worker thread '28') - Error tossed:
> org/apache/poi/POIXMLTextExtractor
>
> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTextExtractor
>
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
> ~[?:?]
>
> <<<<<<
>
>  seems to be the result of putting new POI jars down that are not
> compatible fully with the version of Tika that's there.  Unfortunately,
> this cannot be addressed right now in any way I can think of.  Tika's
> dependencies are legion and they change all the time.
>
> The only thing we can really do is wait for: (1) POI to release their new
> software, and then (2) Tika to release a new release that depends on it.
>
> Karl
>
>
> On Thu, Jul 26, 2018 at 5:33 AM msaunier  wrote:
>
>> Hello Karl,
>>
>>
>>
>> For the moment, it working.
>>
>>
>>
>> I have write this errors but they are not FATAL:
>>
>>
>>
>> DEBUG 2018-07-26T11:30:32,220 (Worker thread '4') - JCIFS: Checking '*'
>> against '/69B_citya_barioz_immobilier/02894_berthollier/Formation/'
>>
>> DEBUG 2018-07-26T11:30:32,220 (Worker thread '4') - JCIFS: Match found.
>>
>> DEBUG 2018-07-26T11:30:32,220 (Worker thread '4') - JCIFS: Leaving
>> checkInclude for
>> 'smb://srv-fichiersqg/Social/_SOCIAL_CABINETS/69B_citya_barioz_immobilier/02894_berthollier/Formation/'
>>
>> DEBUG 2018-07-26T11:30:32,220 (Worker thread '4') - JCIFS: Recorded path
>> is
>> 'smb://srv-fichiersqg/Social/_SOCIAL_CABINETS/69B_citya_barioz_immobilier/02894_berthollier/Formation/'
>> and is included.
>>
>> FATAL 2018-07-26T11:30:32,220 (Worker thread '28') - Error tossed:
>> org/apache/poi/POIXMLTextExtractor
>>
>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTextExtractor
>>
>> at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>> ~[?:?]
>>
>> at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>> at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>> at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>> ~[?:?]
>>
>> at
>> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>> ~[?:?]
>>
>> at
>> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
>> ~[?:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
>> ~[mcf-agents.jar:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>> ~[mcf-agents.jar:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$MonitoredAddActivityWrapper.sendDocument(IncrementalIngester.java:3471)
>> ~[mcf-agents.jar:?]
>>
>> at
>> org.apache.manifoldcf.agents.transformation.contentlimiter.ContentLimiter.addOrReplaceDocumentWithException(ContentLimiter.java:161)
>> ~[?:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:322

Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
che.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> Caused by: java.lang.ClassNotFoundException:
> org.apache.poi.POIXMLTextExtractor
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> ~[?:1.8.0_171]
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> ~[?:1.8.0_171]
>
> at
> java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814)
> ~[?:1.8.0_171]
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ~[?:1.8.0_171]
>
> ... 18 more
>
> AND
>
>
>
> Starting crawler...
>
> juil. 26, 2018 11:29:01 AM
> org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
>
> AVERTISSEMENT: JBIG2ImageReader not loaded. jbig2 files will be ignored
>
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
>
> for optional dependencies.
>
> TIFFImageWriter not loaded. tiff files will not be processed
>
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
>
> for optional dependencies.
>
> J2KImageReader not loaded. JPEG2000 files will not be processed.
>
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
>
> for optional dependencies.
>
>
>
> juil. 26, 2018 11:29:01 AM
> org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
>
> AVERTISSEMENT: org.xerial's sqlite-jdbc is not loaded.
>
> Please provide the jar on your classpath to parse sqlite files.
>
> See tika-parsers/pom.xml for the correct version.
>
>
>
> Maxence,
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mercredi 25 juillet 2018 19:09
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Out of memory, one file bug i think
>
>
>
> That's what I was afraid of.  The new poi jars have dependencies we
> haven't accounted for yet.
>
>
>
> Can you download apache-commons-compress jar (latest version should be OK)
> and also put that in connector-common-lib?  Thanks!!
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jul 25, 2018 at 1:01 PM msaunier  wrote:
>
> Hi Karl,
>
>
>
> I have add the snapshot and I’m spam with this error :
>
>
>
> FATAL 2018-07-25T16:43:04,599 (Worker thread '0') - Error tossed:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> java.lang.NoClassDefFoundError:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> at
> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:62)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getInputStream(ZipFileZipEntrySource.java:66)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:255)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:725) ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:238) ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:197)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:127)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
> ~[?:?]
>
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
> ~[?:?]
>
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.I

Re: Create a new ACTIVITY_FETCH from a transformation

2018-07-26 Thread Karl Wright
ManifoldCF has the concept of "compound document", but all the independent
"components" of the document must be identified at the root level (that is,
in the Repository Connector).

I'm therefore afraid there is no good mapping from ManifoldCF concepts to
what you want to do without writing your own Repository Connector.

Karl


On Thu, Jul 26, 2018 at 5:06 AM Gustavo Beneitez 
wrote:

> Hi Karl,
>
> I made a quick picture of what I really need (attached)
>
>  Certain URLs coming from repository could be split into two: URL1 and
> URL2.
>
> Normal flow acts as only one is present, URL, but writing a new transform
> I could realise also that there is another one: URL2.
> My complain now is: "well, I have URL2 , how can then inject it to the
> flow in order to become a new URL from the repository (and then fetched,
> processed and ingested like others do)?".
>
> Thanks.
>
>
>
> El jue., 26 jul. 2018 a las 0:35, Karl Wright ()
> escribió:
>
>> The crawled URL is transmitted as part of the RepositoryDocument object to
>> the output connector.  If this is going to Solr, it's used as the
>> document's ID.  You can therefore customize Solr (or ElasticSearch) to
>> extract the data you need at the indexing end.
>>
>> If this doesn't make any sense to you, then please be more specific about
>> what the disposition of each crawled document is.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez <
>> gustavo.benei...@gmail.com>
>> wrote:
>>
>> > Hi all,
>> >
>> > I need to extract and analyse crawled urls because they may contain
>> certain
>> > parameters such as "?redirectURL=" that could point to new Documents to
>> be
>> > fetched and indexed.
>> >
>> > First I was trying to create a subclass that extends
>> >
>> > public class RedirectExtractor extends
>> > org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
>> >
>> > and add a "RedirectExtractor" transformation step to the fetch process
>> in
>> > ManifoldCF, but it only allows me to modify current Document, not to
>> create
>> > a new FETCH from the extracted parameter.
>> >
>> > I was investigating manifoldCF source code and I found something that
>> may
>> > be in hand
>> >
>> > activities.recordActivity(null,ACTIVITY_FETCH,
>> > null,urlValue,Integer.toString(-2),"Robots
>> > exclusion",null);
>> >
>> > from the IProcessActivity interface, which is used by the Connectors. I
>> > didn't want to create a new connector since it is a bit complex but, do
>> you
>> > see an alternative or this is the only way?
>> >
>> > Thanks in advance.
>> >
>>
>


Re: web crawler not sharing cookies

2018-07-26 Thread Karl Wright
Here's the documentation from HttpClient on the various cookie policies.
You're probably going to need to read some of the RFCs to see which policy
you want.  I will wait for you to get back to me with a recommendation
before taking any action in the MCF codebase.  Thanks!

https://hc.apache.org/httpcomponents-client-ga/tutorial/html/statemgmt.html

Karl


On Thu, Jul 26, 2018 at 3:19 AM Karl Wright  wrote:

> Ok, so the database for your site crawl contains both z.com and x.y.z.com
> cookies?  And your site pages from domain a.y.z.com receive no cookies at
> all when fetched?  Is that a correct description of the situation?
>
> Please verify that the a.y.z.com pages are part of the protected part of
> your "site".  The regular expression that describes site membership for the
> login sequence you are trying to set up must include them or they will not
> receive any cookies no matter what we do.
>
> If this is set up correctly, then the only explanation is the HttpClient
> cookie policy in effect for site fetches.  It does not look like we
> override the cookie policy anywhere when setting up the client:
>
> PoolingHttpClientConnectionManager poolingConnManager = new
> PoolingHttpClientConnectionManager(RegistryBuilder.create()
>   .register("http",
> PlainConnectionSocketFactory.getSocketFactory())
>   .register("https", myFactory)
>   .build());
> poolingConnManager.setDefaultMaxPerRoute(1);
> poolingConnManager.setValidateAfterInactivity(2000);
> poolingConnManager.setDefaultSocketConfig(SocketConfig.custom()
>   .setTcpNoDelay(true)
>   .setSoTimeout(socketTimeoutMilliseconds)
>   .build());
> connManager = poolingConnManager;
>   }
>
>
> HttpClient tends to default to "strict" when stuff is not specified.  I'll
> see if I can find out what the behavior is.
>
> Karl
>
>
> On Thu, Jul 26, 2018 at 2:29 AM Gustavo Beneitez <
> gustavo.benei...@gmail.com> wrote:
>
>> Hi,
>>
>> database may contain Z.com and X.Y.Z.com if created automatically
>> through a JSP, but not the intermediate one Y.Z.com.
>>
>> if the crawler decides to go to A.Y.Z.com and looking to database Z.com
>> is present, it still doesn't work (it should since A.Y.Z is a sub-domain in
>> Z).
>>
>> Only doing that changes by hand (replacing domain with sub-domain in
>> database) and restarting manifold it begins to work.
>>
>> There might be security constrains somehow, I will consider further
>> analysis.
>>
>> Regards.
>>
>>
>> El jue., 26 jul. 2018 a las 0:06, Karl Wright ()
>> escribió:
>>
>>> The web connector, though, does not filter any cookies.  It takes them
>>> all -- whatever cookies HttpClient is storing at that point.  So you should
>>> see all the cookies in the database table, regardless of their site
>>> affinity, unless HttpClient is refusing to accept a cookie for security
>>> reasons.
>>>
>>> It's also possible that HttpClient is selective about which cookies to
>>> transmit on a page fetch.
>>>
>>> Can you look in the database and tell me whether your cookie gets
>>> stored, or not?  If not, then HttpClient's cookie acceptance policy is not
>>> lenient enough.  If it is in the database, then it's the transmission
>>> policy that is too strict.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Wed, Jul 25, 2018 at 4:36 PM Gustavo Beneitez <
>>> gustavo.benei...@gmail.com> wrote:
>>>
>>>> I agree, but the fact is that if my "login sequence" defines a login
>>>> credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
>>>> X.Y.Z.com", none of the sub-sites receives that cookie, I need to
>>>> write same cookie  for every sub-domain, that solves the situation (and
>>>> thankfully is a language cookie and not a dynamic one).
>>>>
>>>> Regards.
>>>>
>>>> El mié., 25 jul. 2018 a las 19:17, Karl Wright ()
>>>> escribió:
>>>>
>>>>> You should not need to fill the database by hand.  Your login sequence
>>>>> should include whatever redirection etc is used to set the cookies though.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <
>>>>> gustavo.benei...@gmail.com> wrote:
>>>>>
>>>>>> Hi again,
>

Re: web crawler not sharing cookies

2018-07-26 Thread Karl Wright
Ok, so the database for your site crawl contains both z.com and x.y.z.com
cookies?  And your site pages from domain a.y.z.com receive no cookies at
all when fetched?  Is that a correct description of the situation?

Please verify that the a.y.z.com pages are part of the protected part of
your "site".  The regular expression that describes site membership for the
login sequence you are trying to set up must include them or they will not
receive any cookies no matter what we do.

If this is set up correctly, then the only explanation is the HttpClient
cookie policy in effect for site fetches.  It does not look like we
override the cookie policy anywhere when setting up the client:

PoolingHttpClientConnectionManager poolingConnManager = new
PoolingHttpClientConnectionManager(RegistryBuilder.create()
  .register("http", PlainConnectionSocketFactory.getSocketFactory())
  .register("https", myFactory)
  .build());
poolingConnManager.setDefaultMaxPerRoute(1);
poolingConnManager.setValidateAfterInactivity(2000);
poolingConnManager.setDefaultSocketConfig(SocketConfig.custom()
  .setTcpNoDelay(true)
  .setSoTimeout(socketTimeoutMilliseconds)
  .build());
connManager = poolingConnManager;
  }


HttpClient tends to default to "strict" when stuff is not specified.  I'll
see if I can find out what the behavior is.

Karl


On Thu, Jul 26, 2018 at 2:29 AM Gustavo Beneitez 
wrote:

> Hi,
>
> database may contain Z.com and X.Y.Z.com if created automatically through
> a JSP, but not the intermediate one Y.Z.com.
>
> if the crawler decides to go to A.Y.Z.com and looking to database Z.com
> is present, it still doesn't work (it should since A.Y.Z is a sub-domain in
> Z).
>
> Only doing that changes by hand (replacing domain with sub-domain in
> database) and restarting manifold it begins to work.
>
> There might be security constrains somehow, I will consider further
> analysis.
>
> Regards.
>
>
> El jue., 26 jul. 2018 a las 0:06, Karl Wright ()
> escribió:
>
>> The web connector, though, does not filter any cookies.  It takes them
>> all -- whatever cookies HttpClient is storing at that point.  So you should
>> see all the cookies in the database table, regardless of their site
>> affinity, unless HttpClient is refusing to accept a cookie for security
>> reasons.
>>
>> It's also possible that HttpClient is selective about which cookies to
>> transmit on a page fetch.
>>
>> Can you look in the database and tell me whether your cookie gets stored,
>> or not?  If not, then HttpClient's cookie acceptance policy is not lenient
>> enough.  If it is in the database, then it's the transmission policy that
>> is too strict.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Jul 25, 2018 at 4:36 PM Gustavo Beneitez <
>> gustavo.benei...@gmail.com> wrote:
>>
>>> I agree, but the fact is that if my "login sequence" defines a login
>>> credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
>>> X.Y.Z.com", none of the sub-sites receives that cookie, I need to write
>>> same cookie  for every sub-domain, that solves the situation (and
>>> thankfully is a language cookie and not a dynamic one).
>>>
>>> Regards.
>>>
>>> El mié., 25 jul. 2018 a las 19:17, Karl Wright ()
>>> escribió:
>>>
>>>> You should not need to fill the database by hand.  Your login sequence
>>>> should include whatever redirection etc is used to set the cookies though.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <
>>>> gustavo.benei...@gmail.com> wrote:
>>>>
>>>>> Hi again,
>>>>>
>>>>> Thanks Karl, I was able of doing that after defining some "login
>>>>> sequence", but also after filling database (cookiedata table) with certain
>>>>> values due to "domain constrictions".
>>>>> Before every web call, I suspect Manifold only takes cookies from URL
>>>>> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "
>>>>> z.com" it won't be sent, so I added every subdomain by hand and
>>>>> started to work.
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
>>>>> gustavo.benei...@gmail.com>) escribió:
>>>>>
>>>>>> Hi

[jira] [Commented] (CONNECTORS-1517) Documentum Connector uses different "unconstrained" a_content_type filters depending on whether the Content Types tab has been edited

2018-07-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556937#comment-16556937
 ] 

Karl Wright commented on CONNECTORS-1517:
-

[~jamesthomas], the connector was developed under the assumption that the 
content types obtained from Documentum and presented in the UI was complete, 
and that all documents would have one of the described content types.  If you 
want a special checkbox for documents that do NOT have one of the described 
content types, I will need to know the DQL that would match those and only 
those.  Thanks.


> Documentum Connector uses different "unconstrained" a_content_type filters 
> depending on whether the Content Types tab has been edited
> -
>
> Key: CONNECTORS-1517
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1517
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
>
> I am using Manifold 2.10 patched for issue 
> https://issues.apache.org/jira/browse/CONNECTORS-1512
> I find that the "unconstrained" query submitted to Documentum differs 
> depending on whether the Content Types in the job have been edited or not. 
> This can dramatically affect which files are fetched. After editing, there 
> are likely to be fewer.
> For example, having simply created a job connecting to DM and setting only 
> the Paths value to Administrator/james the following request is generated. 
> (Taken from manifoldcf.log).
> Note that there are no a_content_type constraints (and my line break for 
> readibility):
> {code:java}
> DEBUG 2018-07-26T05:52:56,422 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:52:56','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0))
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> Once the Content Types tab has been edited (e.g. to remove the 123w type) it 
> looks like this, i.e. the search constrains to only the selected types (my 
> ellipsis for readibility):
> {code:java}
> DEBUG 2018-07-26T05:58:36,755 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:58:36','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('acad', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> If the 123w type is now reselected in the Content Types tab, the search adds 
> it to the list of a_content_type entries, but doesn't return to the 
> unconstrained initial search:
> {code:java}
> DEBUG 2018-07-26T05:59:16,863 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:59:16','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('123w', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> This means that running what appears to be an equivalent job several times 
> may not fetch the same set of documents from Documentum.
> I expect that the same configuration in the UI produces the same search to 
> Documentum, regardless of how the configuration was arrived at.
> If the selected items in the Content Types list is treated as the only set of 
> files to fetch (i,.e. the initial unconstrained search is considered 
> incorrect here) then I guess I might also like to have flexibility to fetch 
> file types not on the checklist in the Content Types tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1517) Documentum Connector uses different "unconstrained" a_content_type filters depending on whether the Content Types tab has been edited

2018-07-25 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1517:

Fix Version/s: ManifoldCF 2.11

> Documentum Connector uses different "unconstrained" a_content_type filters 
> depending on whether the Content Types tab has been edited
> -
>
> Key: CONNECTORS-1517
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1517
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
>
> I am using Manifold 2.10 patched for issue 
> https://issues.apache.org/jira/browse/CONNECTORS-1512
> I find that the "unconstrained" query submitted to Documentum differs 
> depending on whether the Content Types in the job have been edited or not. 
> This can dramatically affect which files are fetched. After editing, there 
> are likely to be fewer.
> For example, having simply created a job connecting to DM and setting only 
> the Paths value to Administrator/james the following request is generated. 
> (Taken from manifoldcf.log).
> Note that there are no a_content_type constraints (and my line break for 
> readibility):
> {code:java}
> DEBUG 2018-07-26T05:52:56,422 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:52:56','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0))
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> Once the Content Types tab has been edited (e.g. to remove the 123w type) it 
> looks like this, i.e. the search constrains to only the selected types (my 
> ellipsis for readibility):
> {code:java}
> DEBUG 2018-07-26T05:58:36,755 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:58:36','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('acad', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> If the 123w type is now reselected in the Content Types tab, the search adds 
> it to the list of a_content_type entries, but doesn't return to the 
> unconstrained initial search:
> {code:java}
> DEBUG 2018-07-26T05:59:16,863 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:59:16','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('123w', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> This means that running what appears to be an equivalent job several times 
> may not fetch the same set of documents from Documentum.
> I expect that the same configuration in the UI produces the same search to 
> Documentum, regardless of how the configuration was arrived at.
> If the selected items in the Content Types list is treated as the only set of 
> files to fetch (i,.e. the initial unconstrained search is considered 
> incorrect here) then I guess I might also like to have flexibility to fetch 
> file types not on the checklist in the Content Types tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1517) Documentum Connector uses different "unconstrained" a_content_type filters depending on whether the Content Types tab has been edited

2018-07-25 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1517:
---

Assignee: Karl Wright

> Documentum Connector uses different "unconstrained" a_content_type filters 
> depending on whether the Content Types tab has been edited
> -
>
> Key: CONNECTORS-1517
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1517
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
>
> I am using Manifold 2.10 patched for issue 
> https://issues.apache.org/jira/browse/CONNECTORS-1512
> I find that the "unconstrained" query submitted to Documentum differs 
> depending on whether the Content Types in the job have been edited or not. 
> This can dramatically affect which files are fetched. After editing, there 
> are likely to be fewer.
> For example, having simply created a job connecting to DM and setting only 
> the Paths value to Administrator/james the following request is generated. 
> (Taken from manifoldcf.log).
> Note that there are no a_content_type constraints (and my line break for 
> readibility):
> {code:java}
> DEBUG 2018-07-26T05:52:56,422 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:52:56','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0))
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> Once the Content Types tab has been edited (e.g. to remove the 123w type) it 
> looks like this, i.e. the search constrains to only the selected types (my 
> ellipsis for readibility):
> {code:java}
> DEBUG 2018-07-26T05:58:36,755 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:58:36','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('acad', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> If the 123w type is now reselected in the Content Types tab, the search adds 
> it to the list of a_content_type entries, but doesn't return to the 
> unconstrained initial search:
> {code:java}
> DEBUG 2018-07-26T05:59:16,863 (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 01:00:00','mm/dd/ hh:mi:ss') and 
> r_modify_date<=date('07/26/2018 05:59:16','mm/dd/ hh:mi:ss') AND 
> (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0 
> AND a_content_type IN ('123w', ... 'zip_pub_html'))) 
> AND ( Folder('/Administrator/james', DESCEND) ))
> {code}
> This means that running what appears to be an equivalent job several times 
> may not fetch the same set of documents from Documentum.
> I expect that the same configuration in the UI produces the same search to 
> Documentum, regardless of how the configuration was arrived at.
> If the selected items in the Content Types list is treated as the only set of 
> files to fetch (i,.e. the initial unconstrained search is considered 
> incorrect here) then I guess I might also like to have flexibility to fetch 
> file types not on the checklist in the Content Types tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Create a new ACTIVITY_FETCH from a transformation

2018-07-25 Thread Karl Wright
The crawled URL is transmitted as part of the RepositoryDocument object to
the output connector.  If this is going to Solr, it's used as the
document's ID.  You can therefore customize Solr (or ElasticSearch) to
extract the data you need at the indexing end.

If this doesn't make any sense to you, then please be more specific about
what the disposition of each crawled document is.

Thanks,
Karl


On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez 
wrote:

> Hi all,
>
> I need to extract and analyse crawled urls because they may contain certain
> parameters such as "?redirectURL=" that could point to new Documents to be
> fetched and indexed.
>
> First I was trying to create a subclass that extends
>
> public class RedirectExtractor extends
> org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
>
> and add a "RedirectExtractor" transformation step to the fetch process in
> ManifoldCF, but it only allows me to modify current Document, not to create
> a new FETCH from the extracted parameter.
>
> I was investigating manifoldCF source code and I found something that may
> be in hand
>
> activities.recordActivity(null,ACTIVITY_FETCH,
> null,urlValue,Integer.toString(-2),"Robots
> exclusion",null);
>
> from the IProcessActivity interface, which is used by the Connectors. I
> didn't want to create a new connector since it is a bit complex but, do you
> see an alternative or this is the only way?
>
> Thanks in advance.
>


Re: web crawler not sharing cookies

2018-07-25 Thread Karl Wright
The web connector, though, does not filter any cookies.  It takes them all
-- whatever cookies HttpClient is storing at that point.  So you should see
all the cookies in the database table, regardless of their site affinity,
unless HttpClient is refusing to accept a cookie for security reasons.

It's also possible that HttpClient is selective about which cookies to
transmit on a page fetch.

Can you look in the database and tell me whether your cookie gets stored,
or not?  If not, then HttpClient's cookie acceptance policy is not lenient
enough.  If it is in the database, then it's the transmission policy that
is too strict.

Thanks,
Karl


On Wed, Jul 25, 2018 at 4:36 PM Gustavo Beneitez 
wrote:

> I agree, but the fact is that if my "login sequence" defines a login
> credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
> X.Y.Z.com", none of the sub-sites receives that cookie, I need to write
> same cookie  for every sub-domain, that solves the situation (and
> thankfully is a language cookie and not a dynamic one).
>
> Regards.
>
> El mié., 25 jul. 2018 a las 19:17, Karl Wright ()
> escribió:
>
>> You should not need to fill the database by hand.  Your login sequence
>> should include whatever redirection etc is used to set the cookies though.
>>
>> Karl
>>
>>
>> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <
>> gustavo.benei...@gmail.com> wrote:
>>
>>> Hi again,
>>>
>>> Thanks Karl, I was able of doing that after defining some "login
>>> sequence", but also after filling database (cookiedata table) with certain
>>> values due to "domain constrictions".
>>> Before every web call, I suspect Manifold only takes cookies from URL
>>> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "z.com"
>>> it won't be sent, so I added every subdomain by hand and started to work.
>>>
>>> Regards.
>>>
>>>
>>> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
>>> gustavo.benei...@gmail.com>) escribió:
>>>
>>>> Hi,
>>>>
>>>> thanks a lot, please let me check then the documentation for an example
>>>> of that.
>>>>
>>>> Regards!
>>>>
>>>> El jue., 19 jul. 2018 a las 21:54, Karl Wright ()
>>>> escribió:
>>>>
>>>>> You are correct that cookies are not shared among threads.  That is by
>>>>> design.
>>>>>
>>>>> The only way to set cookies for the WebConnector is to have there be a
>>>>> "login sequence".  The login sequence sets cookies that are then used by
>>>>> all subsequent fetches.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>>>>> gustavo.benei...@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I have tried to look for an answer before writing this email, no
>>>>>> luck. Sorry for the inconvenience if it is already answered.
>>>>>>
>>>>>> I need to set a cookie at the begining of the web crawling. The
>>>>>> cookie rules the language you get the content, and while there are 
>>>>>> several
>>>>>> choices, if no cookie is found there will be a "default language".
>>>>>>
>>>>>> I made a JSP which sets the cookie and contains several links (href),
>>>>>> and pointed ManifoldCF to this page as the repository seed. I expected to
>>>>>> get the crawling engine starting to capture links with correct language
>>>>>> indicated by the cookie, but what I really got is a lot of content shown 
>>>>>> in
>>>>>> default language.
>>>>>>
>>>>>> What I think about that is that cookies are not shared between thread
>>>>>> spiders, so it is not possible to get cookies remain between links. 
>>>>>> Cookie
>>>>>> domain is correct, also cookie expiration
>>>>>>
>>>>>> I would appreciate so much  if you can help me on this.
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>>
>>>>>>


Re: Speed up cleaning up job

2018-07-25 Thread Karl Wright
The "cleaning up" phase deletes the documents in the target index (where
your outputconnectors point).  That takes more time.

Karl


On Wed, Jul 25, 2018 at 1:43 PM msaunier  wrote:

> If I delete a job on ManifoldCF, jobs pass in « Cleaning Up » status.
>
>
>
> « Processed » document are delete very fast
>
> « Active » documents too.
>
> But « Documents » on the interface, it’s very slow to delete every lines.
> ManifoldCF delete Documents 100 by 100.
>
>
>
> Maxence,
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mercredi 25 juillet 2018 19:18
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Speed up cleaning up job
>
>
>
> I'm sorry, I don't understand your question?
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jul 25, 2018 at 12:53 PM msaunier  wrote:
>
> Hi Karl,
>
>
>
> Can I configure ManifoldCF to cleaning up faster ? I think, ManifoldCF
> Clean 100 by 100 by default.
>
>
>
> Maxence,
>
>
>
>


Re: Speed up cleaning up job

2018-07-25 Thread Karl Wright
I'm sorry, I don't understand your question?

Karl


On Wed, Jul 25, 2018 at 12:53 PM msaunier  wrote:

> Hi Karl,
>
>
>
> Can I configure ManifoldCF to cleaning up faster ? I think, ManifoldCF
> Clean 100 by 100 by default.
>
>
>
> Maxence,
>
>
>


Re: web crawler not sharing cookies

2018-07-25 Thread Karl Wright
You should not need to fill the database by hand.  Your login sequence
should include whatever redirection etc is used to set the cookies though.

Karl


On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez 
wrote:

> Hi again,
>
> Thanks Karl, I was able of doing that after defining some "login
> sequence", but also after filling database (cookiedata table) with certain
> values due to "domain constrictions".
> Before every web call, I suspect Manifold only takes cookies from URL
> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "z.com"
> it won't be sent, so I added every subdomain by hand and started to work.
>
> Regards.
>
>
> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
> gustavo.benei...@gmail.com>) escribió:
>
>> Hi,
>>
>> thanks a lot, please let me check then the documentation for an example
>> of that.
>>
>> Regards!
>>
>> El jue., 19 jul. 2018 a las 21:54, Karl Wright ()
>> escribió:
>>
>>> You are correct that cookies are not shared among threads.  That is by
>>> design.
>>>
>>> The only way to set cookies for the WebConnector is to have there be a
>>> "login sequence".  The login sequence sets cookies that are then used by
>>> all subsequent fetches.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>>> gustavo.benei...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I have tried to look for an answer before writing this email, no luck.
>>>> Sorry for the inconvenience if it is already answered.
>>>>
>>>> I need to set a cookie at the begining of the web crawling. The cookie
>>>> rules the language you get the content, and while there are several
>>>> choices, if no cookie is found there will be a "default language".
>>>>
>>>> I made a JSP which sets the cookie and contains several links (href),
>>>> and pointed ManifoldCF to this page as the repository seed. I expected to
>>>> get the crawling engine starting to capture links with correct language
>>>> indicated by the cookie, but what I really got is a lot of content shown in
>>>> default language.
>>>>
>>>> What I think about that is that cookies are not shared between thread
>>>> spiders, so it is not possible to get cookies remain between links. Cookie
>>>> domain is correct, also cookie expiration
>>>>
>>>> I would appreciate so much  if you can help me on this.
>>>>
>>>> Thanks in advance!
>>>>
>>>>
>>>>


***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
It looks like you are still running out of memory.  I would love to know
what document it was that doing that.  I suspect it is very large already,
and for some reason it cannot be streamed.

Karl


On Wed, Jul 25, 2018 at 1:13 PM Karl Wright  wrote:

> Hi Maxence,
>
> The second exception is occurring because processing is still occurring
> while the JVM is shutting down; it can be ignored.
>
> Karl
>
>
> On Wed, Jul 25, 2018 at 1:01 PM msaunier  wrote:
>
>> Hi Karl,
>>
>>
>>
>> I have add the snapshot and I’m spam with this error :
>>
>>
>>
>> FATAL 2018-07-25T16:43:04,599 (Worker thread '0') - Error tossed:
>> org/apache/commons/compress/utils/InputStreamStatistics
>>
>> java.lang.NoClassDefFoundError:
>> org/apache/commons/compress/utils/InputStreamStatistics
>>
>> at
>> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:62)
>> ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147)
>> ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34)
>> ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getInputStream(ZipFileZipEntrySource.java:66)
>> ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:255)
>> ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:725) ~[?:?]
>>
>> at
>> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:238) ~[?:?]
>>
>> at
>> org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:197)
>> ~[?:?]
>>
>> at
>> org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:127)
>> ~[?:?]
>>
>> at
>> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
>> ~[?:?]
>>
>> at
>> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
>> ~[?:?]
>>
>> at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
>> ~[?:?]
>>
>> at
>> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>> ~[?:?]
>>
>> at
>> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
>> ~[?:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
>> ~[mcf-agents.jar:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>> ~[mcf-agents.jar:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
>> ~[mcf-agents.jar:?]
>>
>> at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
>> ~[mcf-agents.jar:?]
>>
>> at
>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
>> ~[mcf-pull-agent.jar:?]
>>
>> at
>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
>> ~[mcf-pull-agent.jar:?]
>>
>> at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
>> ~[?:?]
>>
>> at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> [mcf-pull-agent.jar:?]
>>
>>
>>
>> Maxence,
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddy...@gmail.com]
>> *Envoyé :* mercredi 25 juillet 2018 13:12
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Out of memory, one file bug i think
>>
>>
>>
>> Hi Maxence,
>>
>>
>>
>> Tomorrow (7/26) the POI project will be delivering a nightly build which
>> should repair the Class Not Found exceptions.  You will need to download it
>> here:
>>
>>
>> https://builds.apache.org/view/P/view/POI/job/POI-DSL-1.8

Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
Hi Maxence,

The second exception is occurring because processing is still occurring
while the JVM is shutting down; it can be ignored.

Karl


On Wed, Jul 25, 2018 at 1:01 PM msaunier  wrote:

> Hi Karl,
>
>
>
> I have add the snapshot and I’m spam with this error :
>
>
>
> FATAL 2018-07-25T16:43:04,599 (Worker thread '0') - Error tossed:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> java.lang.NoClassDefFoundError:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> at
> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:62)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getInputStream(ZipFileZipEntrySource.java:66)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:255)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:725) ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:238) ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:197)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:127)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
> ~[?:?]
>
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
> ~[?:?]
>
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> Maxence,
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mercredi 25 juillet 2018 13:12
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Out of memory, one file bug i think
>
>
>
> Hi Maxence,
>
>
>
> Tomorrow (7/26) the POI project will be delivering a nightly build which
> should repair the Class Not Found exceptions.  You will need to download it
> here:
>
>
> https://builds.apache.org/view/P/view/POI/job/POI-DSL-1.8/lastSuccessfulBuild/artifact/build/dist/
>
>
>
> ... and replace all poi jars with the corresponding ones from the binary
> distribution.  I believe the poi jars are all in connector-common-lib.  Be
> sure to delete the old ones (or move them somewhere else) first.
>
>
>
> I don't know whether this will fix your out of memory problem however.
> Please let me know what's still not working and I can take it from there.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jul 25, 2018 at 6:03 AM Karl Wright  wrote:
>
> Out of memory errors are fatal, I'm afraid, because they corrupt not only
> the document in question but all others being processed at the same time.
> So those cannot be ignored.
>
>
>
> Tika should ignore documents th

Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
That's what I was afraid of.  The new poi jars have dependencies we haven't
accounted for yet.

Can you download apache-commons-compress jar (latest version should be OK)
and also put that in connector-common-lib?  Thanks!!

Karl


On Wed, Jul 25, 2018 at 1:01 PM msaunier  wrote:

> Hi Karl,
>
>
>
> I have add the snapshot and I’m spam with this error :
>
>
>
> FATAL 2018-07-25T16:43:04,599 (Worker thread '0') - Error tossed:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> java.lang.NoClassDefFoundError:
> org/apache/commons/compress/utils/InputStreamStatistics
>
> at
> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:62)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getInputStream(ZipFileZipEntrySource.java:66)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:255)
> ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:725) ~[?:?]
>
> at
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:238) ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:197)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:127)
> ~[?:?]
>
> at
> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
> ~[?:?]
>
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
> ~[?:?]
>
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
> ~[?:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> Maxence,
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mercredi 25 juillet 2018 13:12
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Out of memory, one file bug i think
>
>
>
> Hi Maxence,
>
>
>
> Tomorrow (7/26) the POI project will be delivering a nightly build which
> should repair the Class Not Found exceptions.  You will need to download it
> here:
>
>
> https://builds.apache.org/view/P/view/POI/job/POI-DSL-1.8/lastSuccessfulBuild/artifact/build/dist/
>
>
>
> ... and replace all poi jars with the corresponding ones from the binary
> distribution.  I believe the poi jars are all in connector-common-lib.  Be
> sure to delete the old ones (or move them somewhere else) first.
>
>
>
> I don't know whether this will fix your out of memory problem however.
> Please let me know what's still not working and I can take it from there.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jul 25, 2018 at 6:03 AM Karl Wright  wrote:
>
> Out of memory errors are fatal, I'm afraid, because they corrupt not only
> the document in question but all others being processed at the 

<    13   14   15   16   17   18   19   20   21   22   >