I just did a check-in which should fix the NPE.

The other exception is a warning; the crawler should retry the
document when that happens, so I would not get excited unless the job
aborts.

Karl


On Tue, Aug 14, 2012 at 5:08 PM, Ahmet Arslan <iori...@yahoo.com> wrote:
>
> Hi Karl,
>
> Somehow those scanned pdf files do not throw exception.
> I tired sending them using curl :
>
> curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true"; 
> -F "myfile=@ticaret_sicil_gazetesi.pdf"
>
> No exception in solr logs. File is indexed. But when i do this, java coffee 
> icon appears in Dock. I don't know what this is. I will further investigate 
> on tika/solr side.
>
> Thanks for your support on this.
>
> Anyways, I still sometimes get :
> "Got an unknown remote exception accessing site - axis fault = 
> Server.userException, detail = java.net.UnknownHostException: null"
>
> I see following entries in manifoldcf.log
>
>
>  WARN 2012-08-14 17:39:41,099 (Thread-10418) - Cookie rejected: "$Version=0; 
> http%3A%2F%2Fiknowtest%2FDiscovery=WorkspaceSiteName=SUtOb3c=&WorkspaceSiteUrl=aHR0cDovL2lrbm93dGVzdA==&WorkspaceSiteTime=MjAxMi0wOC0xNFQxNDozOTo0MQ==;
>  $Path=/_vti_bin/Discovery.asmx". Illegal path attribute 
> "/_vti_bin/Discovery.asmx". Path of origin: 
> "/Pages/denemeIkGeneralPage0712-6740.aspx"
>
>
> FATAL 2012-08-14 17:55:55,096 (Startup thread) - Error tossed: null
> java.lang.NullPointerException
>         at 
> org.apache.manifoldcf.crawler.interfaces.QueueTracker$PriorityKey.hashCode(QueueTracker.java:726)
>         at java.util.HashMap.get(HashMap.java:300)
>         at 
> org.apache.manifoldcf.crawler.interfaces.QueueTracker.calculatePriority(QueueTracker.java:518)
>         at 
> org.apache.manifoldcf.crawler.system.SeedingActivity.writeSeedDocuments(SeedingActivity.java:225)
>         at 
> org.apache.manifoldcf.crawler.system.SeedingActivity.doneSeeding(SeedingActivity.java:165)
>         at 
> org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:181)
>
> --- On Tue, 8/14/12, Karl Wright <daddy...@gmail.com> wrote:
>
>> From: Karl Wright <daddy...@gmail.com>
>> Subject: Re: SharePoint: Error closing connection to file
>> To: dev@manifoldcf.apache.org
>> Date: Tuesday, August 14, 2012, 9:32 AM
>> I've committed a fix to how the
>> WorkerThread handles service
>> interruptions.  This should eliminate the "unexpected
>> value"
>> exception.  Could you confirm that it does?
>>
>> After that, I believe you will have to look at your Tika
>> setup on Solr
>> to figure out how to avoid having PDFs blow up the
>> pipeline.  You
>> should confirm first that Tika is indeed throwing an
>> exception when a
>> PDF is sent to it, of course, and that Solr is closing the
>> http
>> connection under those conditions.
>>
>> Thanks,
>> Karl
>>
>> On Tue, Aug 14, 2012 at 1:28 AM, Karl Wright <daddy...@gmail.com>
>> wrote:
>> > There are two different issues here.  The first
>> one is that you are
>> > having a connection close on you; not sure the reason
>> why, but could
>> > potentially be caused by a Tika exception in
>> Solr.  The second is that
>> > the refactored WorkerThread code I checked in Sunday
>> might have a bug
>> > in handling exceptions of this kind.
>> >
>> > I'll have a look at these and get back to you shortly.
>> >
>> > Karl
>> >
>> > On Mon, Aug 13, 2012 at 10:28 PM, Ahmet Arslan <iori...@yahoo.com>
>> wrote:
>> >> If I modify my Path Rules to index only *.doc and
>> *.docx files, I can re-index over and over without
>> restarting anything. Everything works fine.
>> >> It seems that there is a problem with non text
>> extractable files.
>> >>
>> >> /Documents/*.doc
>> file    include
>> >> /Documents/*.docx
>>    file    include
>> >>
>> >> --- On Tue, 8/14/12, Ahmet Arslan <iori...@yahoo.com>
>> wrote:
>> >>
>> >>> From: Ahmet Arslan <iori...@yahoo.com>
>> >>> Subject: Re: SharePoint: Error closing
>> connection to file
>> >>> To: dev@manifoldcf.apache.org
>> >>> Date: Tuesday, August 14, 2012, 5:20 AM
>> >>>
>> >>> Also after this, when i hit "View Repository
>> Connection
>> >>> Status" i get :
>> >>>
>> >>> Got an unknown remote exception accessing site
>> - axis fault
>> >>> = Server.userException, detail =
>> >>> java.net.UnknownHostException: null
>> >>>
>> >>> I restart mcf, I get "Connection status:
>> Connection working"
>> >>> at "View Repository Connection Status" page.
>> >>>
>> >>> --- On Tue, 8/14/12, Ahmet Arslan <iori...@yahoo.com>
>> >>> wrote:
>> >>>
>> >>> > From: Ahmet Arslan <iori...@yahoo.com>
>> >>> > Subject: SharePoint: Error closing
>> connection to file
>> >>> > To: dev@manifoldcf.apache.org
>> >>> > Date: Tuesday, August 14, 2012, 5:18 AM
>> >>> > Hello,
>> >>> >
>> >>> > Using solr output connector and SP2010
>> Repository
>> >>> connector,
>> >>> > I am indexing a document library named
>> Documents. This
>> >>> > library has some scanned pdf documents.
>> Very First
>> >>> crawl
>> >>> > indexes all 91 docs.
>> >>> > When I hit "Re-ingest all associated
>> documents" and
>> >>> start
>> >>> > second crawl, I get : "Error: Unexpected
>> jobqueue
>> >>> status -
>> >>> > record id 1344907007021, expecting active
>> status, saw
>> >>> 3"
>> >>> >
>> >>> > Here is the stack trace:
>> >>> > When i look at 
>> >>> > http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf,
>> >>> > it is an image (scanned) pdf.
>> >>> >
>> >>> > WARN 2012-08-14 05:13:22,068 (Worker
>> thread '39') -
>> >>> > SharePoint: Error closing connection to
>> file 
>> 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf':
>> >>> > Connection reset
>> >>> > java.net.SocketException: Connection
>> reset
>> >>> >     at
>> >>> >
>> >>>
>> java.net.SocketInputStream.read(SocketInputStream.java:113)
>> >>> >     at
>> >>> >
>> >>>
>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>> >>> >     at
>> >>> >
>> >>>
>> java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>> >>> >     at
>> >>> >
>> >>>
>> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>> >>> > Source)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>> >>> > Source)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown
>> >>> > Source)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown
>> >>> > Source)
>> >>> >     at
>> >>> >
>> >>>
>> java.io.FilterInputStream.close(FilterInputStream.java:155)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown
>> >>> > Source)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown
>> >>> > Source)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1457)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549)
>> >>> > DEBUG 2012-08-14 05:13:22,072 (Worker
>> thread '42') -
>> >>> > SharePoint: Path attribute name is null
>> >>> >  WARN 2012-08-14 05:13:22,081 (Worker
>> thread '39')
>> >>> -
>> >>> > SharePoint: IOException thrown: Connection
>> reset
>> >>> > java.net.SocketException: Connection
>> reset
>> >>> >     at
>> >>> >
>> >>>
>> java.net.SocketInputStream.read(SocketInputStream.java:168)
>> >>> >     at
>> >>> >
>> >>>
>> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>> >>> >     at
>> >>> >
>> >>>
>> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>> >>> > Source)
>> >>> >     at
>> >>> >
>> >>>
>> java.io.FilterInputStream.read(FilterInputStream.java:116)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown
>> >>> > Source)
>> >>> >     at
>> >>> >
>> >>>
>> java.io.FilterInputStream.read(FilterInputStream.java:90)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown
>> >>> > Source)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1447)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549)
>> >>> >  WARN 2012-08-14 05:13:22,186 (Worker
>> thread '39')
>> >>> - Service
>> >>> > interruption reported for job
>> 1344906886879 connection
>> >>> > 'SP2010': SharePoint is down attempting to
>> read 
>> 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf',
>> >>> > retrying: Connection reset
>> >>> > ERROR 2012-08-14 05:13:22,230 (Worker
>> thread '39') -
>> >>> > Exception tossed: Unexpected jobqueue
>> status - record
>> >>> id
>> >>> > 1344907007021, expecting active status,
>> saw 3
>> >>> >
>> >>>
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> >>> > Unexpected jobqueue status - record id
>> 1344907007021,
>> >>> > expecting active status, saw 3
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:711)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:2435)
>> >>> >     at
>> >>> >
>> >>>
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:745)
>> >>> >
>> >>>
>>

Reply via email to