[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-13 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1562:

Attachment: (was: manifoldcf.log.reduced)

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-13 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720982#comment-16720982
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Attached the "reduced" step with query logging.  Analysis will take some time.  
The entire startup log chunk is here (and it contains the seeding part, which 
is what we're interested in):

{code}
DEBUG 2018-12-14T01:07:42,367 (Startup thread) - Requested query: [UPDATE 
jobqueue SET needpriorityprocessid=NULL,needpriority=? WHERE (jobid=? AND 
status=? OR jobid=? AND status=?)]
DEBUG 2018-12-14T01:07:42,368 (Thread-688) - Actual query: [SET SCHEMA PUBLIC]
DEBUG 2018-12-14T01:07:42,368 (Thread-688) - Done actual query (0ms): [SET 
SCHEMA PUBLIC]
DEBUG 2018-12-14T01:07:42,369 (Thread-689) - Actual query: [UPDATE jobqueue SET 
needpriorityprocessid=NULL,needpriority=? WHERE (jobid=? AND status=? OR 
jobid=? AND status=?)]
DEBUG 2018-12-14T01:07:42,369 (Thread-689) -   Parameter 0: 'T'
DEBUG 2018-12-14T01:07:42,369 (Thread-689) -   Parameter 1: '1544121003866'
DEBUG 2018-12-14T01:07:42,369 (Thread-689) -   Parameter 2: 'P'
DEBUG 2018-12-14T01:07:42,369 (Thread-689) -   Parameter 3: '1544121003866'
DEBUG 2018-12-14T01:07:42,369 (Thread-689) -   Parameter 4: 'G'
DEBUG 2018-12-14T01:07:42,369 (Thread-689) - Done actual query (0ms): [UPDATE 
jobqueue SET needpriorityprocessid=NULL,needpriority=? WHERE (jobid=? AND 
status=? OR jobid=? AND status=?)]
DEBUG 2018-12-14T01:07:42,369 (Startup thread) - Beginning transaction of type 2
DEBUG 2018-12-14T01:07:42,369 (Startup thread) - Marking for delete for job 
1544121003866 all hopcount document references from table jobqueue t99 matching 
t99.status IN (?,?)
DEBUG 2018-12-14T01:07:42,370 (Startup thread) - Requested query: [UPDATE 
hopcount SET distance=?,deathmark=? WHERE id IN(SELECT t0.ownerid FROM 
hopdeletedeps t0,jobqueue t99,intrinsiclink t1 WHERE t0.jobid=? AND (t1.jobid=? 
AND t1.parentidhash=t0.parentidhash AND t1.linktype=t0.linktype AND 
t1.childidhash=t0.childidhash) AND (t99.jobid=? AND t99.dochash=t0.childidhash) 
AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,370 (Thread-690) - Actual query: [SET SCHEMA PUBLIC]
DEBUG 2018-12-14T01:07:42,370 (Thread-690) - Done actual query (0ms): [SET 
SCHEMA PUBLIC]
DEBUG 2018-12-14T01:07:42,371 (Thread-691) - Actual query: [UPDATE hopcount SET 
distance=?,deathmark=? WHERE id IN(SELECT t0.ownerid FROM hopdeletedeps 
t0,jobqueue t99,intrinsiclink t1 WHERE t0.jobid=? AND (t1.jobid=? AND 
t1.parentidhash=t0.parentidhash AND t1.linktype=t0.linktype AND 
t1.childidhash=t0.childidhash) AND (t99.jobid=? AND t99.dochash=t0.childidhash) 
AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 0: '-1'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 1: 'D'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 2: '1544121003866'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 3: '1544121003866'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 4: '1544121003866'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 5: 'P'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 6: 'H'
DEBUG 2018-12-14T01:07:42,389 (Thread-691) - Done actual query (18ms): [UPDATE 
hopcount SET distance=?,deathmark=? WHERE id IN(SELECT t0.ownerid FROM 
hopdeletedeps t0,jobqueue t99,intrinsiclink t1 WHERE t0.jobid=? AND (t1.jobid=? 
AND t1.parentidhash=t0.parentidhash AND t1.linktype=t0.linktype AND 
t1.childidhash=t0.childidhash) AND (t99.jobid=? AND t99.dochash=t0.childidhash) 
AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,390 (Startup thread) - Done setting hopcount rows for 
job 1544121003866 to initial distances
DEBUG 2018-12-14T01:07:42,390 (Startup thread) - Requested query: [DELETE FROM 
intrinsiclink WHERE EXISTS(SELECT 'x' FROM jobqueue t99 WHERE (t99.jobid=? AND 
t99.dochash=intrinsiclink.childidhash) AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,390 (Thread-692) - Actual query: [DELETE FROM 
intrinsiclink WHERE EXISTS(SELECT 'x' FROM jobqueue t99 WHERE (t99.jobid=? AND 
t99.dochash=intrinsiclink.childidhash) AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,391 (Thread-692) -   Parameter 0: '1544121003866'
DEBUG 2018-12-14T01:07:42,391 (Thread-692) -   Parameter 1: 'P'
DEBUG 2018-12-14T01:07:42,391 (Thread-692) -   Parameter 2: 'H'
DEBUG 2018-12-14T01:07:42,407 (Thread-692) - Done actual query (17ms): [DELETE 
FROM intrinsiclink WHERE EXISTS(SELECT 'x' FROM jobqueue t99 WHERE (t99.jobid=? 
AND t99.dochash=intrinsiclink.childidhash) AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,408 (Startup thread) - Requested query: [DELETE FROM 
hopdeletedeps WHERE ownerid IN(SELECT id FROM hopcount WHERE (jobid=? AND 
deathmark=?))]
DEBUG 2018-12-14T01:07:42,408 (Thread-693) - Actual query: [DELETE FROM 
hopdeletedeps WHERE ownerid IN(SELECT id FROM hopcount WHERE (jobid=? AND 
deathmark=?))]
DEBUG 2018-12-14T01:07:42,409 

[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-13 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1562:

Attachment: manifoldcf.log.reduced

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2018-12-13 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720171#comment-16720171
 ] 

Karl Wright commented on CONNECTORS-1563:
-

If you need me to debug your solr setup, you're going to need to wait a couple 
of weeks, I'm afraid.  I'm extremely behind and I honestly have been working 20 
hour days.  You're on your own for now.


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: managed-schema, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2018-12-13 Thread Sneha (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720085#comment-16720085
 ] 

Sneha commented on CONNECTORS-1563:
---

I am not getting the missing configuration in solr. I have attached the 
managed-schema and solr-config.xml.

Created core using command: solr create -c corename 

Please help me with the configuration. 

> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: managed-schema, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2018-12-13 Thread Sneha (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sneha updated CONNECTORS-1563:
--
Attachment: solrconfig.xml
managed-schema

> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: managed-schema, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2018-12-13 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720061#comment-16720061
 ] 

Karl Wright commented on CONNECTORS-1563:
-

That argues that your solr configuration is not correct, because this was 
tested thoroughly during the last release cycle.


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2018-12-13 Thread Sneha (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720018#comment-16720018
 ] 

Sneha commented on CONNECTORS-1563:
---

Upgraded the ManifoldCF with version 2.11 but still same error is showing at 
solr server

> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2018-12-13 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1563:
---

Assignee: Karl Wright

> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2018-12-13 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1563:

Component/s: (was: Solr 7.x component)
 Lucene/SOLR connector

> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-13 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719945#comment-16719945
 ] 

Karl Wright commented on CONNECTORS-1562:
-

I went through the invalidation logic, which was last changed in 2012 as part 
of ticket CONNECTORS-501.  The fix for that ticket did not seem related to the 
current problem at least.

I improved logging methods in this area, and documentation, but did not yet 
find any logical errors.  The next step is therefore to repeat the experiment 
with database debugging also enabled so I can see the queries too.  I can 
probably start by doing this only during the reduce phase, but if those queries 
look good, I'd have to also do the initial phase.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)