Re: delta-import of rich documents like word and pdf files!
Hi Guys, I probably found a way to mime the delta import for the fileEntityProcessor ( I have used it for xml files ... ) Adding this configuration in the xml-data-config : entity name=personeImpreseList rootEntity=false dataSource=null processor=FileListEntityProcessor fileName=^.*\.xml$ recursive=false baseDir=/data/listPersoneImprese *newerThan='${dataimporter.last_index_time}'* And using command : *command=full-importclean=false* * * Solr adds to the index only the files that were changed from the last indexing session . Probably this was an obvious way, but I want to know your opinion about this. Cheers 2011/11/12 neuron005 neuron...@gmail.com I want to perform delta import of my rich documents like pdf and word files. I added pk=something in my data-config.xml file. But now I dont know my next step. How delta-import will come to know which fields get updated .I am not getting connected to database. Is there any query like database queries of deltaImportQuery and deltaQuery? Does anyone has a solution? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3502039.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Benedetti Alessandro Personal Page: http://tigerbolt.altervista.org Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: delta-import of rich documents like word and pdf files!
I am using solr 3.4 and configured my DataImportHandler to get some data from MySql as well as index some rich document from the disk. This is the part of db-data-config file where i am indexing Rich text documents. entity name=resume dataSource=ds-db query=Select name,js_login_id div 25000 as dir from js_resumes where js_login_id='${js_logins.id}' and is_primary = 1 and deleted=0 and mask_cv != 1 pk=resume_name deltaQuery=select js_login_id from js_resumes where modified '${dataimporter.last_index_time}' and is_primary = 1 and deleted=0 parentDeltaQuery=select jsl.id as id from service_request_histories srh,service_requests sr, js_login_screenings jsls, js_logins jsl where jsl.status IN(1,2) and srh.service_request_id = sr.id and jsl.id=jsls.js_login_id and srh.status in ('8','43') and jsls.id=srh.sid and date(srh.created) date_sub(now(),interval 2 day) and jsl.id = '${js_resumes.js_login_id}' entity processor=TikaEntityProcessor tikaConfig=tika-config.xml url=http://localhost/resumes-new/resumes${resume.dir}/${js_logins.id}/${resume.name}; dataSource=ds-file format=text field column=text name=resume / /entity /entity But after some time i get the following error in my error log. It looks like a class missing error, Can anyone tell me which poi jar version would work with tika.0.6. Currently I have poi-3.7.jar. Error which i am getting is this SEVERE: Exception while processing: js_logins document : SolrInputDocument[{id=id(1.0)={100984}, complete_mobile_number=complete_mobile_number(1.0)={+91 9600067575}, emailid=emailid(1.0)={vkry...@gmail.com}, full_name=full_name(1.0)={Venkat Ryali}}]:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: org.apache.poi.xwpf.usermodel.XWPFParagraph.init(Lorg/openxmlformats/schemas/wordprocessingml/x2006/main/CTP;Lorg/apache/poi/xwpf/usermodel/XWPFDocument;)V at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:669) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408) Caused by: java.lang.NoSuchMethodError: org.apache.poi.xwpf.usermodel.XWPFParagraph.init(Lorg/openxmlformats/schemas/wordprocessingml/x2006/main/CTP;Lorg/apache/poi/xwpf/usermodel/XWPFDocument;)V at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator$MyXWPFParagraph.init(XWPFWordExtractorDecorator.java:163) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator$MyXWPFParagraph.init(XWPFWordExtractorDecorator.java:161) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTableContent(XWPFWordExtractorDecorator.java:140) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:91) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:69) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:51) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596) ... 7 more -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3524047.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
When I set my fileSize of type string. It shows error as I have posted above. Then I changed it to slong and results was severe..here is log 18 Nov, 2011 3:00:54 PM org.apache.solr.response.BinaryResponseWriter$Resolver getDoc WARNING: Error reading a field from document : SolrDocument[{}] java.lang.StringIndexOutOfBoundsException: String index out of range: 3 at java.lang.String.charAt(String.java:694) at org.apache.solr.util.NumberUtils.SortableStr2long(NumberUtils.java:152) at org.apache.solr.schema.SortableLongField.toObject(SortableLongField.java:70) at org.apache.solr.schema.SortableLongField.toObject(SortableLongField.java:37) at org.apache.solr.response.BinaryResponseWriter$Resolver.getDoc(BinaryResponseWriter.java:148) at org.apache.solr.response.BinaryResponseWriter$Resolver.writeDocList(BinaryResponseWriter.java:122) at org.apache.solr.response.BinaryResponseWriter$Resolver.resolve(BinaryResponseWriter.java:86) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:144) at org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:134) at org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:222) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:139) at org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:87) at org.apache.solr.response.BinaryResponseWriter.getParsedResponse(BinaryResponseWriter.java:191) at org.apache.solr.response.VelocityResponseWriter.write(VelocityResponseWriter.java:57) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:343) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:929) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:964) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:304) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) what does this mean? -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3518171.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
Thank you for your replies guys.that helped a lot. Thanks iorixxx that was the command that worked out. I also tried my solr with mysql and that worked too. Congo! :) Now, I want to index my files according to their size and facet them according to their size ranges. I know that there is an option of fileSize in FileListEntityProcessor but I am not getting any way to perform this. Is fileSize a metadata? If it is, then the steps I performed are - I created a field name and dynamic field name in schema. as dynamicField name=metadata_* type=string indexed=true stored=true multiValued=false/ field name=fileSize type=string indexed=true stored=true required=false/ Added range facet in solrconfig.xml and in data-config.xml I added a str according to field in data-config.xml --solrconfig.xml int name=f.fileSize.facet.range.start0/int int name=f.fileSize.facet.range.gap100/int int name=f.fileSize.facet.range.end600/int --- data-config.xml--- field column=FileSize name=fileSize / --- But that did not work out! Am I missing something? Please help me out. Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515298.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
Now, I want to index my files according to their size and facet them according to their size ranges. I know that there is an option of fileSize in FileListEntityProcessor but I am not getting any way to perform this. Is fileSize a metadata? You don't need a dynamic field for this. The following additions should enable and populate fileSize: in data-config.xml : entity name=f processor=FileListEntityProcessor ... field column=fileSize name=fileSize/ /entity in schema.xml : field name=fileSize type=string indexed=true stored=true required=false/
Re: delta-import of rich documents like word and pdf files!
Thanks for your reply, I performed these steps. in data-config.xml : entity name=f processor=FileListEntityProcessor ... field column=fileSize name=fileSize/ /entity in schema.xml : field name=fileSize type=string indexed=true stored=true required=false/ -- But still there is no response in browse sectionI edited facet_ranges.vm for this. It does not calculate size of the documents. can you please tell me the command to check that in response it shows size of file? Thanks again -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515495.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
And also I set my fileSize of type long. String will not work I think ! Size can not be a string...it shows error on using string as type. -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515505.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
I ran this command and can see size of my files http://localhost:8080/solr/select?q=userf.fileSize.facet.range.start=100 Great thanks...string worked...i dont know why that did not work last time But when I do that in browse section..following output i saw in my logs SEVERE: Exception during facet.range of fileSize:org.apache.solr.common.SolrException: Unable to range facet on field:fileSize{type=string,properties=indexed,stored,omitNorms,omitTermFreqAndPositions,sortMissingLast} at org.apache.solr.request.SimpleFacets.getFacetRangeCounts(SimpleFacets.java:834) at org.apache.solr.request.SimpleFacets.getFacetRangeCounts(SimpleFacets.java:778) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:178) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servl.. This does not come when I set it to type int and when I use int it does not show size!! Please help me out -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515567.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
Sorry for disturbing you allactually I had to add plong instead of type string. My problem is solved Be ready for new thread CHEERS -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515711.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
Thanks for your reply Mr. Erick All I want to do is that I have indexed some of my pdf files and doc files. Now, any changes I make to them, I want a delta-import(incremental) so that I do not have to re index whole document by full import . Only changes made to these documents should get updated. I am using dataimporthandler. I have seen in forums but all of them have queried for delta import related to databases. I am just indexing some of my doc and pdf files for now. What should I do in order to achieve that? Can you provide your data-config.xml?
Re: delta-import of rich documents like word and pdf files!
Thanks for your reply...my data-config.xml is dataConfig dataSource type=BinFileDataSource name=bin/ document entity name=f pk=id processor=FileListEntityProcessor recursive=true rootEntity=false dataSource=null baseDir=/var/data/solr fileName=.*\.(DOC)|(PDF)|(XML)|(xml)|(JPEG)|(jpg)|(ZIP)|(zip)|(pdf)|(doc) onError=skip entity name=tika-test processor=TikaEntityProcessor url=${f.fileAbsolutePath} format=text dataSource=bin onError=skip field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=text/ field column=id name=id/ /entity field column=file name=fileName/ field column=fileAbsolutePath name=links/ /entity /document /dataConfig -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3506404.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
Thanks for your reply...my data-config.xml is dataConfig dataSource type=BinFileDataSource name=bin/ document entity name=f pk=id processor=FileListEntityProcessor recursive=true rootEntity=false dataSource=null baseDir=/var/data/solr fileName=.*\.(DOC)|(PDF)|(XML)|(xml)|(JPEG)|(jpg)|(ZIP)|(zip)|(pdf)|(doc) onError=skip entity name=tika-test processor=TikaEntityProcessor url=${f.fileAbsolutePath} format=text dataSource=bin onError=skip field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=text/ field column=id name=id/ /entity field column=file name=fileName/ field column=fileAbsolutePath name=links/ /entity /document /dataConfig According to wiki : the only EntityProcessor which supports delta is SqlEntityProcessor. May be you can use newerThan parameter of FileListEntityProcessor. Issuing a full-import with clean=false may mimic delta import. You can pass value of this newerThan parameter in your request. command=full-importclean=falsemyLastModifiedParam=NOW-3DAYS http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters
Re: delta-import of rich documents like word and pdf files!
And you cannot update-in-place. That is, you can't update just selected fields in a document, you have to re-index the whole document. Best Erick On Mon, Nov 14, 2011 at 6:11 AM, Ahmet Arslan iori...@yahoo.com wrote: Thanks for your reply...my data-config.xml is dataConfig dataSource type=BinFileDataSource name=bin/ document entity name=f pk=id processor=FileListEntityProcessor recursive=true rootEntity=false dataSource=null baseDir=/var/data/solr fileName=.*\.(DOC)|(PDF)|(XML)|(xml)|(JPEG)|(jpg)|(ZIP)|(zip)|(pdf)|(doc) onError=skip entity name=tika-test processor=TikaEntityProcessor url=${f.fileAbsolutePath} format=text dataSource=bin onError=skip field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=text/ field column=id name=id/ /entity field column=file name=fileName/ field column=fileAbsolutePath name=links/ /entity /document /dataConfig According to wiki : the only EntityProcessor which supports delta is SqlEntityProcessor. May be you can use newerThan parameter of FileListEntityProcessor. Issuing a full-import with clean=false may mimic delta import. You can pass value of this newerThan parameter in your request. command=full-importclean=falsemyLastModifiedParam=NOW-3DAYS http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters
Re: delta-import of rich documents like word and pdf files!
Thanks for your reply Mr. Erick All I want to do is that I have indexed some of my pdf files and doc files. Now, any changes I make to them, I want a delta-import(incremental) so that I do not have to re index whole document by full import . Only changes made to these documents should get updated. I am using dataimporthandler. I have seen in forums but all of them have queried for delta import related to databases. I am just indexing some of my doc and pdf files for now. What should I do in order to achieve that? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3505949.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
and changes are : file content, maybe I change its author and headers -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3505951.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
Can you give more details about what you're trying to do? It looks like you're using DataImportHandler? What defines a document needing to be re-indexed? How do you expect to be able to identify them??? Perhaps you can review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Sat, Nov 12, 2011 at 4:20 AM, neuron005 neuron...@gmail.com wrote: I want to perform delta import of my rich documents like pdf and word files. I added pk=something in my data-config.xml file. But now I dont know my next step. How delta-import will come to know which fields get updated .I am not getting connected to database. Is there any query like database queries of deltaImportQuery and deltaQuery? Does anyone has a solution? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3502039.html Sent from the Solr - User mailing list archive at Nabble.com.