Re: tika integration exception and other related queries
Hi Gary, Similar thing we are doing, but we are not creating an XML doc, rather we are leaving TIKA to extract the content and depends on dynamic fields. We are not storing the text as well. But not sure if in future that would be the case. What about microsoft 7 and later related attachments. Is this working for you, because we are always getting number format exception. I posted as well in the community, but till now no response has some. Thanks Naveen On Thu, Jun 9, 2011 at 6:43 PM, Gary Taylor wrote: > Naveen, > > Not sure our requirement matches yours, but one of the things we index is a > "comment" item that can have one or more files attached to it. To index the > whole thing as a single Solr document we create a zipfile containing a file > with the comment details in it and any additional attached files. This is > submitted to Solr as a TEXT field in an XML doc, along with other meta-data > fields from the comment. In our schema the TEXT field is indexed but not > stored, so when we search and get a match back it doesn't contain all of the > contents from the attached files etc., only the stored fields in our schema. > Admittedly, the user can therefore get back a "comment" match with no > indication as to WHERE the match occurred (ie. was it in the meta-data or > the contents of the attached files), but at the moment we're only interested > in getting appropriate matches, not explaining where the match is. > > Hope that helps. > > Kind regards, > Gary. > > > > > On 09/06/2011 03:00, Naveen Gupta wrote: > >> Hi Gary >> >> It started working .. though i did not test for Zip files, but for rar >> files, it is working fine .. >> >> only thing what i wanted to do is to index the metadata (text mapped to >> content) not store the data Also in search result, i want to filter >> the >> stuffs ... and it started working fine .. i don't want to show the content >> stuffs to the end user, since the way it extracts the information is not >> very helpful to the user .. although we can apply few of the analyzers and >> filters to remove the unnecessary tags ..still the information would not >> be >> of much help .. looking for your opinion ... what you did in order to >> filter >> out the content or are you showing the content extracted to the end user? >> >> Even in case, we are showing the text part to the end user, how can i >> limit >> the number of characters while querying the search results ... is there >> any >> feature where we can achieve this ... the concept of snippet kind of thing >> ... >> >> Thanks >> Naveen >> >> On Wed, Jun 8, 2011 at 1:45 PM, Gary Taylor wrote: >> >> Naveen, >>> >>> For indexing Zip files with Tika, take a look at the following thread : >>> >>> >>> >>> http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html >>> >>> I got it to work with the 3.1 source and a couple of patches. >>> >>> Hope this helps. >>> >>> Regards, >>> Gary. >>> >>> >>> >>> On 08/06/2011 04:12, Naveen Gupta wrote: >>> >>> Hi Can somebody answer this ... 3. can somebody tell me an idea how to do indexing for a zip file ? 1. while sending docx, we are getting following error. >
Re: tika integration exception and other related queries
Naveen, Not sure our requirement matches yours, but one of the things we index is a "comment" item that can have one or more files attached to it. To index the whole thing as a single Solr document we create a zipfile containing a file with the comment details in it and any additional attached files. This is submitted to Solr as a TEXT field in an XML doc, along with other meta-data fields from the comment. In our schema the TEXT field is indexed but not stored, so when we search and get a match back it doesn't contain all of the contents from the attached files etc., only the stored fields in our schema. Admittedly, the user can therefore get back a "comment" match with no indication as to WHERE the match occurred (ie. was it in the meta-data or the contents of the attached files), but at the moment we're only interested in getting appropriate matches, not explaining where the match is. Hope that helps. Kind regards, Gary. On 09/06/2011 03:00, Naveen Gupta wrote: Hi Gary It started working .. though i did not test for Zip files, but for rar files, it is working fine .. only thing what i wanted to do is to index the metadata (text mapped to content) not store the data Also in search result, i want to filter the stuffs ... and it started working fine .. i don't want to show the content stuffs to the end user, since the way it extracts the information is not very helpful to the user .. although we can apply few of the analyzers and filters to remove the unnecessary tags ..still the information would not be of much help .. looking for your opinion ... what you did in order to filter out the content or are you showing the content extracted to the end user? Even in case, we are showing the text part to the end user, how can i limit the number of characters while querying the search results ... is there any feature where we can achieve this ... the concept of snippet kind of thing ... Thanks Naveen On Wed, Jun 8, 2011 at 1:45 PM, Gary Taylor wrote: Naveen, For indexing Zip files with Tika, take a look at the following thread : http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html I got it to work with the 3.1 source and a couple of patches. Hope this helps. Regards, Gary. On 08/06/2011 04:12, Naveen Gupta wrote: Hi Can somebody answer this ... 3. can somebody tell me an idea how to do indexing for a zip file ? 1. while sending docx, we are getting following error.
Re: tika integration exception and other related queries
Hi Gary It started working .. though i did not test for Zip files, but for rar files, it is working fine .. only thing what i wanted to do is to index the metadata (text mapped to content) not store the data Also in search result, i want to filter the stuffs ... and it started working fine .. i don't want to show the content stuffs to the end user, since the way it extracts the information is not very helpful to the user .. although we can apply few of the analyzers and filters to remove the unnecessary tags ..still the information would not be of much help .. looking for your opinion ... what you did in order to filter out the content or are you showing the content extracted to the end user? Even in case, we are showing the text part to the end user, how can i limit the number of characters while querying the search results ... is there any feature where we can achieve this ... the concept of snippet kind of thing ... Thanks Naveen On Wed, Jun 8, 2011 at 1:45 PM, Gary Taylor wrote: > Naveen, > > For indexing Zip files with Tika, take a look at the following thread : > > > http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html > > I got it to work with the 3.1 source and a couple of patches. > > Hope this helps. > > Regards, > Gary. > > > > On 08/06/2011 04:12, Naveen Gupta wrote: > >> Hi Can somebody answer this ... >> >> 3. can somebody tell me an idea how to do indexing for a zip file ? >> >> 1. while sending docx, we are getting following error. >> > >
Re: tika integration exception and other related queries
Naveen, For indexing Zip files with Tika, take a look at the following thread : http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html I got it to work with the 3.1 source and a couple of patches. Hope this helps. Regards, Gary. On 08/06/2011 04:12, Naveen Gupta wrote: Hi Can somebody answer this ... 3. can somebody tell me an idea how to do indexing for a zip file ? 1. while sending docx, we are getting following error.
tika integration exception and other related queries
Hi Can somebody answer this ... 3. can somebody tell me an idea how to do indexing for a zip file ? 1. while sending docx, we are getting following error. java.lang. > > NumberFormatException: For input string: "2011-01-27T07:18:00Z" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) > at java.lang.Long.parseLong(Long.java:412) > at java.lang.Long.parseLong(Long.java:461) > at org.apache.solr.schema.TrieField.createField(TrieField.java:434) > at > org.apache.solr.schema.SchemaField.createField(SchemaField.java:98) > at > org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:204) > at > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:277) > at > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:238) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) > at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) > at java.lang.Thread.run(Thread.java:619) > Thanks Naveen On Tue, Jun 7, 2011 at 3:33 PM, Naveen Gupta wrote: > Hi > > We are using requestextractinghandler and we are getting following error. > we are giving microsoft docx file for indexing. > > I think that this is something to do with field date definition .. but now > very sure ...what field type should we use? > > 2. we are trying to index jpg (when we search over the name of the jpg, it > is not coming .. though in id i am passing one) > > 3. what about zip files or rar files.. does tika with solr handle this one > ? > > > java.lang.NumberFormatException: For input string: > "2011-01-27T07:18:00Z" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) > at java.lang.Long.parseLong(Long.java:412) > at java.lang.Long.parseLong(Long.java:461) > at org.apache.solr.schema.TrieField.createField(TrieField.java:434) > at > org.apache.solr.schema.SchemaField.createField(SchemaField.java:98) > at > org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:204) > at > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:277) > at > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:238) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:13