Let's try to figure out why we can't index streamed data from these
.aspx files.  Can you add enough debugging output to figure out what
the connector is actually trying to stream to Solr?  In order to do
that you may well need to write a class that wraps the input stream
that is handed to Solr with one that outputs enough information for us
to make sense of this.

What might be happening might be that the content length is missing or
wrong, and as a result the transfer just keeps going or something.

Karl

On Mon, Jan 14, 2013 at 3:23 PM, Ahmet Arslan <iori...@yahoo.com> wrote:
> Hi Karl,
>
> I think people may want to index content aspx files, so treating them 
> specially may not be a good solution.
>
> In our environment, aspx files are used to construct a web site that used 
> internally. In my understanding this one of the use cases of SharePoint. In 
> our case content of aspx files are fetched from a List. We can access content 
> of aspx files from List. They don't have html tags etc in it.
>
> But I am not sure if this is common usage of aspx and Lists.
>
>
> I was thinking some option like index only metadata that simple ignores 
> document it self.
>
> By the way I checked some of skipped aspx files their sizes are not too big. 
> 101 KB, 139 KB etc.
>
> I suspect some other factor is triggering this. Also I am seeing this weird 
> warning on jetty that runs solr.
>
> WARN:oejh.HttpParser:Full [1771440721,-1,m=5,g=6144,p=6144,c=6144]={2F73
>
> Thanks,
> Ahmet
>
> --- On Mon, 1/14/13, Karl Wright <daddy...@gmail.com> wrote:
>
>> From: Karl Wright <daddy...@gmail.com>
>> Subject: Re: Repeated service interruptions - failure processing document: 
>> null
>> To: dev@manifoldcf.apache.org
>> Date: Monday, January 14, 2013, 6:46 PM
>> Hi Ahmet,
>>
>> We could specifically treat .aspx files specially, so that
>> they are
>> considered to never have any content.  But are there
>> cases where
>> someone might want to index any content that these URLs
>> might return?
>> Specifically, what do .aspx "files" typically contain, when
>> found in a
>> SharePoint hierarchy?
>>
>> Karl
>>
>> On Mon, Jan 14, 2013 at 11:37 AM, Ahmet Arslan <iori...@yahoo.com>
>> wrote:
>> > Hi Karl,
>> >
>> > Now 39 aspx files (out of 130) are indexed. Job didn't
>> get killed. No exceptions in the log.
>> >
>> > I increased the maximum POST size of solr/jetty but
>> that 39 number didn't increased.
>> >
>> > I will check the size of remaining 130 - 39 *.aspx
>> files.
>> >
>> > Actually I am mapping extracted content of this aspx
>> files to a ignored dynamic field.
>> (fmap.content=content_ignored) I don't use them. I am only
>> interested in metadata of these aspx files. It would be
>> great if there is a setting  to just grab metadata.
>> Similar to Lists.
>> >
>> > Thanks,
>> > Ahmet
>> >
>> > --- On Mon, 1/14/13, Karl Wright <daddy...@gmail.com>
>> wrote:
>> >
>> >> From: Karl Wright <daddy...@gmail.com>
>> >> Subject: Re: Repeated service interruptions -
>> failure processing document: null
>> >> To: dev@manifoldcf.apache.org
>> >> Date: Monday, January 14, 2013, 5:46 PM
>> >> I checked in a fix for this ticket on
>> >> trunk.  Please let me know if it
>> >> resolves this issue.
>> >>
>> >> Karl
>> >>
>> >> On Mon, Jan 14, 2013 at 10:20 AM, Karl Wright
>> <daddy...@gmail.com>
>> >> wrote:
>> >> > This is because httpclient is retrying on
>> error for
>> >> three times by
>> >> > default.  This has to be disabled in the
>> Solr
>> >> connector, or the rest
>> >> > of the logic won't work right.
>> >> >
>> >> > I've opened a ticket (CONNECTORS-610) for this
>> problem
>> >> too.
>> >> >
>> >> > Karl
>> >> >
>> >> > On Mon, Jan 14, 2013 at 10:13 AM, Ahmet Arslan
>> <iori...@yahoo.com>
>> >> wrote:
>> >> >> Hi Karl,
>> >> >>
>> >> >> Thanks for quick fix.
>> >> >>
>> >> >> I am still seeing the following error
>> after 'svn
>> >> up' and 'ant build'
>> >> >>
>> >> >> ERROR 2013-01-14 17:09:41,949 (Worker
>> thread '6') -
>> >> Exception tossed: Repeated service interruptions -
>> failure
>> >> processing document: null
>> >> >>
>> >>
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> >> Repeated service interruptions - failure
>> processing
>> >> document: null
>> >> >>         at
>> >>
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
>> >> >> Caused by:
>> >> org.apache.http.client.ClientProtocolException
>> >> >>         at
>> >>
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909)
>> >> >>         at
>> >>
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
>> >> >>         at
>> >>
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
>> >> >>         at
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
>> >> >>         at
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >> >>         at
>> >>
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>> >> >>         at
>> >>
>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:790)
>> >> >> Caused by:
>> >>
>> org.apache.http.client.NonRepeatableRequestException:
>> Cannot
>> >> retry request with a non-repeatable request
>> entity.
>> >> The cause lists the reason the original request
>> failed.
>> >> >>         at
>> >>
>> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:692)
>> >> >>         at
>> >>
>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:523)
>> >> >>         at
>> >>
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
>> >> >>         ...
>> 6 more
>> >> >> Caused by: java.net.SocketException:
>> Broken pipe
>> >> >>         at
>> >> java.net.SocketOutputStream.socketWrite0(Native
>> Method)
>> >> >>         at
>> >>
>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>> >> >>         at
>> >>
>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>> >> >>         at
>> >>
>> org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:169)
>> >> >>         at
>> >>
>> org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:110)
>> >> >>         at
>> >>
>> org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:165)
>> >> >>         at
>> >>
>> org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:92)
>> >> >>         at
>> >>
>> org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:98)
>> >> >>         at
>> >>
>> org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
>> >> >>         at
>> >>
>> org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:122)
>> >> >>         at
>> >>
>> org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:271)
>> >> >>         at
>> >>
>> org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:197)
>> >> >>         at
>> >>
>> org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:257)
>> >> >>         at
>> >>
>> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
>> >> >>         at
>> >>
>> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:718)
>> >> >>         ...
>> 8 more
>> >> >>
>> >> >>
>> >> >>
>> >> >> --- On Mon, 1/14/13, Karl Wright <daddy...@gmail.com>
>> >> wrote:
>> >> >>
>> >> >>> From: Karl Wright <daddy...@gmail.com>
>> >> >>> Subject: Re: Repeated service
>> interruptions -
>> >> failure processing document: null
>> >> >>> To: dev@manifoldcf.apache.org
>> >> >>> Date: Monday, January 14, 2013, 3:30
>> PM
>> >> >>> Hi Ahmet,
>> >> >>>
>> >> >>> The exception that seems to be causing
>> the
>> >> abort is a socket
>> >> >>> exception
>> >> >>> coming from a socket write:
>> >> >>>
>> >> >>> > Caused by:
>> java.net.SocketException:
>> >> Broken pipe
>> >> >>>
>> >> >>> This makes sense in light of the http
>> code
>> >> returned from
>> >> >>> Solr, which
>> >> >>> was 413:  http://www.checkupdown.com/status/E413.html .
>> >> >>>
>> >> >>> So there is nothing actually *wrong*
>> with the
>> >> .aspx
>> >> >>> documents, but
>> >> >>> they are just way too big, and Solr
>> is
>> >> rejecting them for
>> >> >>> that reason.
>> >> >>>
>> >> >>> Clearly, though, the Solr connector
>> should
>> >> recognize this
>> >> >>> code as
>> >> >>> meaning "never retry", so instead of
>> killing
>> >> the job, it
>> >> >>> should just
>> >> >>> skip the document.  I'll open a
>> ticket for
>> >> that now.
>> >> >>>
>> >> >>> Karl
>> >> >>>
>> >> >>>
>> >> >>> On Mon, Jan 14, 2013 at 8:22 AM, Ahmet
>> Arslan
>> >> <iori...@yahoo.com>
>> >> >>> wrote:
>> >> >>> > Hello,
>> >> >>> >
>> >> >>> > I am indexing a SharePoint 2010
>> instance
>> >> using
>> >> >>> mcf-trunk (At revision 1432907)
>> >> >>> >
>> >> >>> > There is no problem with a
>> Document
>> >> library that
>> >> >>> contains word excel etc.
>> >> >>> >
>> >> >>> > However, I receive the following
>> errors
>> >> with a Document
>> >> >>> library that has *.aspx files in it.
>> >> >>> >
>> >> >>> > Status of Jobs => Error:
>> Repeated
>> >> service
>> >> >>> interruptions - failure processing
>> document:
>> >> null
>> >> >>> >
>> >> >>> >  WARN 2013-01-14
>> 15:00:12,720 (Worker
>> >> thread '13')
>> >> >>> - Service interruption reported for
>> job
>> >> 1358009105156
>> >> >>> connection 'iknow': IO exception
>> during
>> >> indexing: null
>> >> >>> > ERROR 2013-01-14 15:00:12,763
>> (Worker
>> >> thread '13') -
>> >> >>> Exception tossed: Repeated service
>> >> interruptions - failure
>> >> >>> processing document: null
>> >> >>> >
>> >> >>>
>> >>
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> >> >>> Repeated service interruptions -
>> failure
>> >> processing
>> >> >>> document: null
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
>> >> >>> > Caused by:
>> >> >>>
>> org.apache.http.client.ClientProtocolException
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:768)
>> >> >>> > Caused by:
>> >> >>>
>> >>
>> org.apache.http.client.NonRepeatableRequestException:
>> >> Cannot
>> >> >>> retry request with a non-repeatable
>> request
>> >> entity.
>> >> >>> The cause lists the reason the
>> original request
>> >> failed.
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:692)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:523)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
>> >> >>> >
>>    ...
>> >> 6 more
>> >> >>> > Caused by:
>> java.net.SocketException:
>> >> Broken pipe
>> >> >>> >
>>    at
>> >> >>>
>> java.net.SocketOutputStream.socketWrite0(Native
>> >> Method)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:169)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:110)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:165)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:92)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:98)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:122)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:271)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:197)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:257)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:718)
>> >> >>> >
>>    ...
>> >> 8 more
>> >> >>> >
>> >> >>> > Status of Jobs => Error:
>> Unhandled Solr
>> >> exception
>> >> >>> during indexing (0): Server at http://localhost:8983/solr/all 
>> >> >>> returned non ok
>> >> >>> status:413, message:FULL head
>> >> >>> >
>> >> >>> >
>> >>    ERROR 2013-01-14
>> >> >>> 15:10:42,074 (Worker thread '15') -
>> Exception
>> >> tossed:
>> >> >>> Unhandled Solr exception during
>> indexing (0):
>> >> Server at http://localhost:8983/solr/all returned
>> >> non ok
>> >> >>> status:413, message:FULL head
>> >> >>> >
>> >> >>>
>> >>
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> >> >>> Unhandled Solr exception during
>> indexing (0):
>> >> Server at http://localhost:8983/solr/all returned
>> >> non ok
>> >> >>> status:413, message:FULL head
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrException(HttpPoster.java:360)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.agents.output.solr.HttpPoster.indexPost(HttpPoster.java:477)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.agents.output.solr.SolrConnector.addOrReplaceDocument(SolrConnector.java:594)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1559)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>> >> >>> >
>>    at
>> >> >>>
>> >>
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>> >> >>> >
>> >> >>> > On the solr side I see :
>> >> >>> >
>> >> >>> > INFO: Creating new http client,
>> >> >>>
>> >>
>> config:maxConnections=200&maxConnectionsPerHost=8
>> >> >>> > 2013-01-14
>> >> 15:18:21.775:WARN:oejh.HttpParser:Full
>> >> >>>
>> >>
>> [671412972,-1,m=5,g=6144,p=6144,c=6144]={2F736F6C722F616
>> >> >>> ...long long chars ... 2B656B6970{}
>> >> >>> >
>> >> >>> > Thanks,
>> >> >>> > Ahmet
>> >> >>>
>> >>
>>
>

Reply via email to