[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763395#comment-16763395 ] Karl Wright commented on CONNECTORS-1579: - You can either check out the entire current trunk source code and build that, or download the release source and libs, apply the patch, and build that. Which do you want to do? > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: 636_bb2.csv, CONNECTORS-1579.patch > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed: document '636' > java.lang.IllegalStateException: Multiple document primary component > dispositions not allowed: document '636' > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?]{noformat} > I looked this error up on the internet and it said that it might have > something to do with using the same key for different lines. > I checked, but I couldn't find any duplicates that match any of the selected > fields in the JDBC. > Hereby my queries: > Seeding query > {code:java} > SELECT pk1 as $(IDCOLUMN) > FROM dbo.bb2 > WHERE search_url IS NOT NULL > AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', > 'application/xml', 'application/zip'); > {code} > Version check query: none > Access token query: none > Data query: > > > {code:java} > SELECT > pk1 AS $(IDCOLUMN), > search_url AS $(URLCOLUMN), > ISNULL(content, '') AS $(DATACOLUMN), > doc_id, > search_url AS url, > ISNULL(title, '') as title, > ISNULL(groups,'') as groups, > ISNULL(type,'') as document_type, > ISNULL(users, '') as users > FROM dbo.bb2 > WHERE pk1 IN $(IDLIST); > {code} > The hereby added csv is the corresponding line from the table. > [^636_bb2.csv] > > Due to this problem, the whole crawling pipeline is being held up. It keeps > on retrying this line. > Could you help me understand this error? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760852#comment-16760852 ] Donald Van den Driessche commented on CONNECTORS-1579: -- Thanks for clearing this out. > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: 636_bb2.csv > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed: document '636' > java.lang.IllegalStateException: Multiple document primary component > dispositions not allowed: document '636' > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?]{noformat} > I looked this error up on the internet and it said that it might have > something to do with using the same key for different lines. > I checked, but I couldn't find any duplicates that match any of the selected > fields in the JDBC. > Hereby my queries: > Seeding query > {code:java} > SELECT pk1 as $(IDCOLUMN) > FROM dbo.bb2 > WHERE search_url IS NOT NULL > AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', > 'application/xml', 'application/zip'); > {code} > Version check query: none > Access token query: none > Data query: > > > {code:java} > SELECT > pk1 AS $(IDCOLUMN), > search_url AS $(URLCOLUMN), > ISNULL(content, '') AS $(DATACOLUMN), > doc_id, > search_url AS url, > ISNULL(title, '') as title, > ISNULL(groups,'') as groups, > ISNULL(type,'') as document_type, > ISNULL(users, '') as users > FROM dbo.bb2 > WHERE pk1 IN $(IDLIST); > {code} > The hereby added csv is the corresponding line from the table. > [^636_bb2.csv] > > Due to this problem, the whole crawling pipeline is being held up. It keeps > on retrying this line. > Could you help me understand this error? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760848#comment-16760848 ] Karl Wright commented on CONNECTORS-1579: - It's a bug in the code. Whenever the JDBC connector rejects a document based on what the downstream pipeline tells it to do, it improperly accounts for that and you get this error. The fix is quite simple and I can attach a patch, and will do so shortly. Thanks! > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: 636_bb2.csv > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed: document '636' > java.lang.IllegalStateException: Multiple document primary component > dispositions not allowed: document '636' > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?]{noformat} > I looked this error up on the internet and it said that it might have > something to do with using the same key for different lines. > I checked, but I couldn't find any duplicates that match any of the selected > fields in the JDBC. > Hereby my queries: > Seeding query > {code:java} > SELECT pk1 as $(IDCOLUMN) > FROM dbo.bb2 > WHERE search_url IS NOT NULL > AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', > 'application/xml', 'application/zip'); > {code} > Version check query: none > Access token query: none > Data query: > > > {code:java} > SELECT > pk1 AS $(IDCOLUMN), > search_url AS $(URLCOLUMN), > ISNULL(content, '') AS $(DATACOLUMN), > doc_id, > search_url AS url, > ISNULL(title, '') as title, > ISNULL(groups,'') as groups, > ISNULL(type,'') as document_type, > ISNULL(users, '') as users > FROM dbo.bb2 > WHERE pk1 IN $(IDLIST); > {code} > The hereby added csv is the corresponding line from the table. > [^636_bb2.csv] > > Due to this problem, the whole crawling pipeline is being held up. It keeps > on retrying this line. > Could you help me understand this error? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760818#comment-16760818 ] Karl Wright commented on CONNECTORS-1579: - Hi, The proximate cause of the problem is that there are multiple "resolutions" occurring for one document in the JDBC crawl set. When a connector is asked to process a document, it must tell the framework what is to be done with it -- either it gets indexed, or it gets skipped, or it gets deleted. The problem is that the connector is telling the framework TWO things for the same document. The code in question: {code} // Now, go through the original id's, and see which ones are still in the map. These // did not appear in the result and are presumed to be gone from the database, and thus must be deleted. for (final String documentIdentifier : fetchDocuments) { if (!seenDocuments.contains(documentIdentifier)) { // Never saw it in the fetch attempt activities.deleteDocument(documentIdentifier); } else { // Saw it in the fetch attempt, and we might have fetched it final String documentVersion = map.get(documentIdentifier); if (documentVersion != null) { // This means we did not see it (or data for it) in the result set. Delete it! activities.noDocument(documentIdentifier,documentVersion); {code} It's failing on the last line. The connector thinks there is in fact no document that exists (based on the version query you gave it), BUT based on the results of the other queries, it thinks the document does exist (and was in fact processed). I will need to look carefully at the queries and at the connector code to figure out exactly how that can happen, and then I can let you know whether it's a bug in the code or a bug in your queries. Stay tuned. > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: 636_bb2.csv > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed: document '636' > java.lang.IllegalStateException: Multiple document primary component > dispositions not allowed: document '636' > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?]{noformat} > I looked this error up on the internet and it said that it might have > something to do with using the same key for different lines. > I checked, but I couldn't find any duplicates that match any of the selected > fields in the JDBC. > Hereby my queries: > Seeding query > {code:java} > SELECT pk1 as $(IDCOLUMN) > FROM dbo.bb2 > WHERE search_url IS NOT NULL > AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', > 'application/xml', 'application/zip'); > {code} > Version check query: none > Access token query: none > Data query: > > > {code:java} > SELECT > pk1 AS $(IDCOLUMN), > search_url AS $(URLCOLUMN), > ISNULL(content, '') AS $(DATACOLUMN), > doc_id, > search_url AS url, > ISNULL(title, '') as title, > ISNULL(groups,'') as groups, > ISNULL(type,'') as document_type, > ISNULL(users, '') as users > FROM dbo.bb2 > WHERE pk1 IN $(IDLIST); > {code} > The hereby added csv is the corresponding line from the table. > [^636_bb2.csv] > > Due to this problem, the whole crawling pipeline is being held up. It keeps > on retrying this line. > Could you help me understand this error? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)