[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table

2019-02-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763395#comment-16763395
 ] 

Karl Wright commented on CONNECTORS-1579:
-

You can either check out the entire current trunk source code and build that, 
or download the release source and libs, apply the patch, and build that.  
Which do you want to do?


> Error when crawling a MSSQL table
> -
>
> Key: CONNECTORS-1579
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1579
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JDBC connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: 636_bb2.csv, CONNECTORS-1579.patch
>
>
> When I'm crawling a MSSQL table through the JDBC connector I get following 
> error on multiple lines:
>  
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple 
> document primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component 
> dispositions not allowed: document '636'
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have 
> something to do with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected 
> fields in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> SELECT pk1 as $(IDCOLUMN)
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 
> 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
>  
>  
> {code:java}
> SELECT 
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> WHERE pk1 IN $(IDLIST);
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
>  
> Due to this problem, the whole crawling pipeline is being held up. It keeps 
> on retrying this line.
> Could you help me understand this error?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table

2019-02-05 Thread Donald Van den Driessche (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760852#comment-16760852
 ] 

Donald Van den Driessche commented on CONNECTORS-1579:
--

Thanks for clearing this out.

> Error when crawling a MSSQL table
> -
>
> Key: CONNECTORS-1579
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1579
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JDBC connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: 636_bb2.csv
>
>
> When I'm crawling a MSSQL table through the JDBC connector I get following 
> error on multiple lines:
>  
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple 
> document primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component 
> dispositions not allowed: document '636'
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have 
> something to do with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected 
> fields in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> SELECT pk1 as $(IDCOLUMN)
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 
> 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
>  
>  
> {code:java}
> SELECT 
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> WHERE pk1 IN $(IDLIST);
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
>  
> Due to this problem, the whole crawling pipeline is being held up. It keeps 
> on retrying this line.
> Could you help me understand this error?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table

2019-02-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760848#comment-16760848
 ] 

Karl Wright commented on CONNECTORS-1579:
-

It's a bug in the code.
Whenever the JDBC connector rejects a document based on what the downstream 
pipeline tells it to do, it improperly accounts for that and you get this 
error.  The fix is quite simple and I can attach a patch, and will do so 
shortly. Thanks!


> Error when crawling a MSSQL table
> -
>
> Key: CONNECTORS-1579
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1579
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JDBC connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: 636_bb2.csv
>
>
> When I'm crawling a MSSQL table through the JDBC connector I get following 
> error on multiple lines:
>  
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple 
> document primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component 
> dispositions not allowed: document '636'
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have 
> something to do with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected 
> fields in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> SELECT pk1 as $(IDCOLUMN)
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 
> 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
>  
>  
> {code:java}
> SELECT 
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> WHERE pk1 IN $(IDLIST);
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
>  
> Due to this problem, the whole crawling pipeline is being held up. It keeps 
> on retrying this line.
> Could you help me understand this error?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table

2019-02-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760818#comment-16760818
 ] 

Karl Wright commented on CONNECTORS-1579:
-

Hi,

The proximate cause of the problem is that there are multiple "resolutions" 
occurring for one document in the JDBC crawl set.  When a connector is asked to 
process a document, it must tell the framework what is to be done with it -- 
either it gets indexed, or it gets skipped, or it gets deleted.  The problem is 
that the connector is telling the framework TWO things for the same document.  
The code in question:

{code}
// Now, go through the original id's, and see which ones are still in the 
map.  These
// did not appear in the result and are presumed to be gone from the 
database, and thus must be deleted.
for (final String documentIdentifier : fetchDocuments)
{
  if (!seenDocuments.contains(documentIdentifier))
  {
// Never saw it in the fetch attempt
activities.deleteDocument(documentIdentifier);
  }
  else
  {
// Saw it in the fetch attempt, and we might have fetched it
final String documentVersion = map.get(documentIdentifier);
if (documentVersion != null)
{
  // This means we did not see it (or data for it) in the result set.  
Delete it!
  activities.noDocument(documentIdentifier,documentVersion);
{code}

It's failing on the last line.  The connector thinks there is in fact no 
document that exists (based on the version query you gave it), BUT based on the 
results of the other queries, it thinks the document does exist (and was in 
fact processed).

I will need to look carefully at the queries and at the connector code to 
figure out exactly how that can happen, and then I can let you know whether 
it's a bug in the code or a bug in your queries.  Stay tuned.


> Error when crawling a MSSQL table
> -
>
> Key: CONNECTORS-1579
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1579
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JDBC connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: 636_bb2.csv
>
>
> When I'm crawling a MSSQL table through the JDBC connector I get following 
> error on multiple lines:
>  
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple 
> document primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component 
> dispositions not allowed: document '636'
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have 
> something to do with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected 
> fields in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> SELECT pk1 as $(IDCOLUMN)
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 
> 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
>  
>  
> {code:java}
> SELECT 
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> WHERE pk1 IN $(IDLIST);
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
>  
> Due to this problem, the whole crawling pipeline is being held up. It keeps 
> on retrying this line.
> Could you help me understand this error?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)