I just did a check-in which should fix the NPE. The other exception is a warning; the crawler should retry the document when that happens, so I would not get excited unless the job aborts.
Karl On Tue, Aug 14, 2012 at 5:08 PM, Ahmet Arslan <iori...@yahoo.com> wrote: > > Hi Karl, > > Somehow those scanned pdf files do not throw exception. > I tired sending them using curl : > > curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" > -F "myfile=@ticaret_sicil_gazetesi.pdf" > > No exception in solr logs. File is indexed. But when i do this, java coffee > icon appears in Dock. I don't know what this is. I will further investigate > on tika/solr side. > > Thanks for your support on this. > > Anyways, I still sometimes get : > "Got an unknown remote exception accessing site - axis fault = > Server.userException, detail = java.net.UnknownHostException: null" > > I see following entries in manifoldcf.log > > > WARN 2012-08-14 17:39:41,099 (Thread-10418) - Cookie rejected: "$Version=0; > http%3A%2F%2Fiknowtest%2FDiscovery=WorkspaceSiteName=SUtOb3c=&WorkspaceSiteUrl=aHR0cDovL2lrbm93dGVzdA==&WorkspaceSiteTime=MjAxMi0wOC0xNFQxNDozOTo0MQ==; > $Path=/_vti_bin/Discovery.asmx". Illegal path attribute > "/_vti_bin/Discovery.asmx". Path of origin: > "/Pages/denemeIkGeneralPage0712-6740.aspx" > > > FATAL 2012-08-14 17:55:55,096 (Startup thread) - Error tossed: null > java.lang.NullPointerException > at > org.apache.manifoldcf.crawler.interfaces.QueueTracker$PriorityKey.hashCode(QueueTracker.java:726) > at java.util.HashMap.get(HashMap.java:300) > at > org.apache.manifoldcf.crawler.interfaces.QueueTracker.calculatePriority(QueueTracker.java:518) > at > org.apache.manifoldcf.crawler.system.SeedingActivity.writeSeedDocuments(SeedingActivity.java:225) > at > org.apache.manifoldcf.crawler.system.SeedingActivity.doneSeeding(SeedingActivity.java:165) > at > org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:181) > > --- On Tue, 8/14/12, Karl Wright <daddy...@gmail.com> wrote: > >> From: Karl Wright <daddy...@gmail.com> >> Subject: Re: SharePoint: Error closing connection to file >> To: dev@manifoldcf.apache.org >> Date: Tuesday, August 14, 2012, 9:32 AM >> I've committed a fix to how the >> WorkerThread handles service >> interruptions. This should eliminate the "unexpected >> value" >> exception. Could you confirm that it does? >> >> After that, I believe you will have to look at your Tika >> setup on Solr >> to figure out how to avoid having PDFs blow up the >> pipeline. You >> should confirm first that Tika is indeed throwing an >> exception when a >> PDF is sent to it, of course, and that Solr is closing the >> http >> connection under those conditions. >> >> Thanks, >> Karl >> >> On Tue, Aug 14, 2012 at 1:28 AM, Karl Wright <daddy...@gmail.com> >> wrote: >> > There are two different issues here. The first >> one is that you are >> > having a connection close on you; not sure the reason >> why, but could >> > potentially be caused by a Tika exception in >> Solr. The second is that >> > the refactored WorkerThread code I checked in Sunday >> might have a bug >> > in handling exceptions of this kind. >> > >> > I'll have a look at these and get back to you shortly. >> > >> > Karl >> > >> > On Mon, Aug 13, 2012 at 10:28 PM, Ahmet Arslan <iori...@yahoo.com> >> wrote: >> >> If I modify my Path Rules to index only *.doc and >> *.docx files, I can re-index over and over without >> restarting anything. Everything works fine. >> >> It seems that there is a problem with non text >> extractable files. >> >> >> >> /Documents/*.doc >> file include >> >> /Documents/*.docx >> file include >> >> >> >> --- On Tue, 8/14/12, Ahmet Arslan <iori...@yahoo.com> >> wrote: >> >> >> >>> From: Ahmet Arslan <iori...@yahoo.com> >> >>> Subject: Re: SharePoint: Error closing >> connection to file >> >>> To: dev@manifoldcf.apache.org >> >>> Date: Tuesday, August 14, 2012, 5:20 AM >> >>> >> >>> Also after this, when i hit "View Repository >> Connection >> >>> Status" i get : >> >>> >> >>> Got an unknown remote exception accessing site >> - axis fault >> >>> = Server.userException, detail = >> >>> java.net.UnknownHostException: null >> >>> >> >>> I restart mcf, I get "Connection status: >> Connection working" >> >>> at "View Repository Connection Status" page. >> >>> >> >>> --- On Tue, 8/14/12, Ahmet Arslan <iori...@yahoo.com> >> >>> wrote: >> >>> >> >>> > From: Ahmet Arslan <iori...@yahoo.com> >> >>> > Subject: SharePoint: Error closing >> connection to file >> >>> > To: dev@manifoldcf.apache.org >> >>> > Date: Tuesday, August 14, 2012, 5:18 AM >> >>> > Hello, >> >>> > >> >>> > Using solr output connector and SP2010 >> Repository >> >>> connector, >> >>> > I am indexing a document library named >> Documents. This >> >>> > library has some scanned pdf documents. >> Very First >> >>> crawl >> >>> > indexes all 91 docs. >> >>> > When I hit "Re-ingest all associated >> documents" and >> >>> start >> >>> > second crawl, I get : "Error: Unexpected >> jobqueue >> >>> status - >> >>> > record id 1344907007021, expecting active >> status, saw >> >>> 3" >> >>> > >> >>> > Here is the stack trace: >> >>> > When i look at >> >>> > http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf, >> >>> > it is an image (scanned) pdf. >> >>> > >> >>> > WARN 2012-08-14 05:13:22,068 (Worker >> thread '39') - >> >>> > SharePoint: Error closing connection to >> file >> 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf': >> >>> > Connection reset >> >>> > java.net.SocketException: Connection >> reset >> >>> > at >> >>> > >> >>> >> java.net.SocketInputStream.read(SocketInputStream.java:113) >> >>> > at >> >>> > >> >>> >> java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >> >>> > at >> >>> > >> >>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:258) >> >>> > at >> >>> > >> >>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:317) >> >>> > at >> >>> > >> >>> >> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> >> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> >> org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> >> org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> >> java.io.FilterInputStream.close(FilterInputStream.java:155) >> >>> > at >> >>> > >> >>> >> org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> >> org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> >> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1457) >> >>> > at >> >>> > >> >>> >> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) >> >>> > at >> >>> > >> >>> >> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) >> >>> > DEBUG 2012-08-14 05:13:22,072 (Worker >> thread '42') - >> >>> > SharePoint: Path attribute name is null >> >>> > WARN 2012-08-14 05:13:22,081 (Worker >> thread '39') >> >>> - >> >>> > SharePoint: IOException thrown: Connection >> reset >> >>> > java.net.SocketException: Connection >> reset >> >>> > at >> >>> > >> >>> >> java.net.SocketInputStream.read(SocketInputStream.java:168) >> >>> > at >> >>> > >> >>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:256) >> >>> > at >> >>> > >> >>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:317) >> >>> > at >> >>> > >> >>> >> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> >> java.io.FilterInputStream.read(FilterInputStream.java:116) >> >>> > at >> >>> > >> >>> >> org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> >> java.io.FilterInputStream.read(FilterInputStream.java:90) >> >>> > at >> >>> > >> >>> >> org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> >> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1447) >> >>> > at >> >>> > >> >>> >> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) >> >>> > at >> >>> > >> >>> >> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) >> >>> > WARN 2012-08-14 05:13:22,186 (Worker >> thread '39') >> >>> - Service >> >>> > interruption reported for job >> 1344906886879 connection >> >>> > 'SP2010': SharePoint is down attempting to >> read >> 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf', >> >>> > retrying: Connection reset >> >>> > ERROR 2012-08-14 05:13:22,230 (Worker >> thread '39') - >> >>> > Exception tossed: Unexpected jobqueue >> status - record >> >>> id >> >>> > 1344907007021, expecting active status, >> saw 3 >> >>> > >> >>> >> org.apache.manifoldcf.core.interfaces.ManifoldCFException: >> >>> > Unexpected jobqueue status - record id >> 1344907007021, >> >>> > expecting active status, saw 3 >> >>> > at >> >>> > >> >>> >> org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:711) >> >>> > at >> >>> > >> >>> >> org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:2435) >> >>> > at >> >>> > >> >>> >> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:745) >> >>> > >> >>> >>