Re: Need help handeling corrupted files

Julien Nioche Fri, 05 Aug 2011 04:51:13 -0700

Simply change your solr schema and make the field title multivalued

On 5 August 2011 12:38, Marek Bachmann <[email protected]> wrote:


> Hey ho,
>
> i have a problem with a url that seems to be an vcf document.
> Let me explain:
>
>  When I try to build an solr index, this url is responsible for this error
> message:
>
> SEVERE: org.apache.solr.common.**SolrException: ERROR: [
> http://cms.uni-kassel.de/asl/**en/fb/staff.html?tx_**
> wtdirectory_pi1%5BvCard%5D=10<http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10>]
> multiple values encountered for non multiValued field title: [Universität
> Kassel, Fachbereich 6 ASL: Faculty Members, Lolita_Hörnlein.vcf]
>        at org.apache.solr.update.**DocumentBuilder.toDocument(**
> DocumentBuilder.java:242)
>        at org.apache.solr.update.**processor.RunUpdateProcessor.**
> processAdd(**RunUpdateProcessorFactory.**java:60)
>        at org.apache.solr.handler.**XMLLoader.processUpdate(**
> XMLLoader.java:147)
>        at org.apache.solr.handler.**XMLLoader.load(XMLLoader.java:**77)
>        at org.apache.solr.handler.**ContentStreamHandlerBase.**
> handleRequestBody(**ContentStreamHandlerBase.java:**67)
>        at org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
> RequestHandlerBase.java:129)
>        at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1360)
>        at org.apache.solr.servlet.**SolrDispatchFilter.execute(**
> SolrDispatchFilter.java:356)
>        at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> SolrDispatchFilter.java:252)
>        at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1212)
>        at org.mortbay.jetty.servlet.**ServletHandler.handle(**
> ServletHandler.java:399)
>        at org.mortbay.jetty.security.**SecurityHandler.handle(**
> SecurityHandler.java:216)
>        at org.mortbay.jetty.servlet.**SessionHandler.handle(**
> SessionHandler.java:182)
>        at org.mortbay.jetty.handler.**ContextHandler.handle(**
> ContextHandler.java:766)
>        at org.mortbay.jetty.webapp.**WebAppContext.handle(**
> WebAppContext.java:450)
>        at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
> ContextHandlerCollection.java:**230)
>        at org.mortbay.jetty.handler.**HandlerCollection.handle(**
> HandlerCollection.java:114)
>        at org.mortbay.jetty.handler.**HandlerWrapper.handle(**
> HandlerWrapper.java:152)
>        at org.mortbay.jetty.Server.**handle(Server.java:326)
>        at org.mortbay.jetty.**HttpConnection.handleRequest(**
> HttpConnection.java:542)
>        at org.mortbay.jetty.**HttpConnection$RequestHandler.**
> content(HttpConnection.java:**945)
>        at org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:843)
>        at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.**
> java:212)
>        at org.mortbay.jetty.**HttpConnection.handle(**
> HttpConnection.java:404)
>        at org.mortbay.jetty.bio.**SocketConnector$Connection.**
> run(SocketConnector.java:228)
>        at org.mortbay.thread.**QueuedThreadPool$PoolThread.**
> run(QueuedThreadPool.java:582)
>
>
> The url is:
>
> http://cms.uni-kassel.de/asl/**en/fb/staff.html?tx_**
> wtdirectory_pi1%5BvCard%5D=10<http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10>
>
> When I download it separately it delivers following response:
>
> Status=OK - 200
> Date=Fri, 05 Aug 2011 11:09:12 GMT
> Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
> X-Powered-By=PHP/5.2.0-8+**etch16
> Content-Disposition=**attachment; filename=Lolita_Hörnlein.vcf
> Pragma=public
> Content-Type=text/directory
> Set-Cookie=fe_typo_user=**316c4c91100f95fb57c5e8d39d32f9**9d; path=/asl/
> Via=1.1 cms.uni-kassel.de
> Vary=Accept-Encoding
> Content-Encoding=gzip
> Content-Length=5043
> Keep-Alive=timeout=15, max=99
> Connection=Keep-Alive
>
> I have inspected this file and find out that it is corrupted, it seems that
> besides the prober vcf data, there is generated html code in this file. This
> seems to be a misbehaviour from some plugin in the cms.
>
> My Question is how to handle such files. It looks like the parser sets to
> much values in the title field, so solr can't handle it.
>
> For a quick solution it would be best if I could configure tika in that
> way, that it won't parse the vcf. But I don't know how to do that.
>
> Any suggestions for this problem?
>
> Thank you very much.
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Need help handeling corrupted files

Reply via email to