Re: delta-import of rich documents like word and pdf files!

2011-12-22 Thread Alessandro Benedetti
Hi Guys,
I probably found a way to mime the delta import for the fileEntityProcessor
( I have used it for xml files ... )
Adding this configuration in the xml-data-config :

entity name=personeImpreseList rootEntity=false dataSource=null
processor=FileListEntityProcessor
fileName=^.*\.xml$ recursive=false
baseDir=/data/listPersoneImprese
*newerThan='${dataimporter.last_index_time}'*


And using command :
*command=full-importclean=false*
*
*
Solr adds to the index only the files that were changed from the last
indexing session .

Probably this was an obvious way, but I want to know your opinion about
this.

Cheers

2011/11/12 neuron005 neuron...@gmail.com

 I want to perform delta import of my rich documents like pdf and word
  files.
 I added pk=something in my data-config.xml file.
 But now I dont know my next step. How delta-import will come to know which
 fields get updated .I am not getting connected to database. Is there any
 query like database queries of deltaImportQuery and deltaQuery?
  Does anyone has a solution?
 Thanks in advance

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3502039.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
--

Benedetti Alessandro
Personal Page: http://tigerbolt.altervista.org

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


Re: delta-import of rich documents like word and pdf files!

2011-11-20 Thread kumar8anuj
I am using solr 3.4 and configured my DataImportHandler to get some data from
MySql as well as index some rich document from the disk. 

This is the part of db-data-config file where i am indexing Rich text
documents. 


entity name=resume dataSource=ds-db query=Select
name,js_login_id div 25000 as dir from js_resumes where
js_login_id='${js_logins.id}' and is_primary = 1 and deleted=0 and mask_cv
!= 1 pk=resume_name 
deltaQuery=select js_login_id from js_resumes where
modified  '${dataimporter.last_index_time}' and is_primary = 1 and
deleted=0 
parentDeltaQuery=select  jsl.id as id  from
service_request_histories srh,service_requests sr, js_login_screenings jsls,
js_logins jsl where jsl.status IN(1,2) and srh.service_request_id = sr.id 
and jsl.id=jsls.js_login_id and srh.status in ('8','43') and jsls.id=srh.sid
and date(srh.created)  date_sub(now(),interval 2 day) and jsl.id =
'${js_resumes.js_login_id}' 
 
entity processor=TikaEntityProcessor
tikaConfig=tika-config.xml
url=http://localhost/resumes-new/resumes${resume.dir}/${js_logins.id}/${resume.name};
dataSource=ds-file format=text
field column=text name=resume /
/entity
/entity


But after some time i get the following error in my error log. It looks like
a class missing error, Can anyone tell me which poi jar version would work
with tika.0.6. Currently I have  poi-3.7.jar. 

Error which i am getting is this  

SEVERE: Exception while processing: js_logins document :
SolrInputDocument[{id=id(1.0)={100984},
complete_mobile_number=complete_mobile_number(1.0)={+91 9600067575},
emailid=emailid(1.0)={vkry...@gmail.com}, full_name=full_name(1.0)={Venkat
Ryali}}]:org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoSuchMethodError:
org.apache.poi.xwpf.usermodel.XWPFParagraph.init(Lorg/openxmlformats/schemas/wordprocessingml/x2006/main/CTP;Lorg/apache/poi/xwpf/usermodel/XWPFDocument;)V
 
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:669)
 
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
 
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
 
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268) 
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187) 
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
 
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427) 
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408) 
Caused by: java.lang.NoSuchMethodError:
org.apache.poi.xwpf.usermodel.XWPFParagraph.init(Lorg/openxmlformats/schemas/wordprocessingml/x2006/main/CTP;Lorg/apache/poi/xwpf/usermodel/XWPFDocument;)V
 
at
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator$MyXWPFParagraph.init(XWPFWordExtractorDecorator.java:163)
 
at
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator$MyXWPFParagraph.init(XWPFWordExtractorDecorator.java:161)
 
at
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTableContent(XWPFWordExtractorDecorator.java:140)
 
at
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:91)
 
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:69)
 
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:51) 
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) 
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) 
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
 
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
 
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
 
... 7 more

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3524047.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-18 Thread neuron005
When I set my fileSize of type string. It shows error as I have posted above.
Then I changed it to slong and results was severe..here is log
18 Nov, 2011 3:00:54 PM
org.apache.solr.response.BinaryResponseWriter$Resolver getDoc
WARNING: Error reading a field from document : SolrDocument[{}]
java.lang.StringIndexOutOfBoundsException: String index out of range: 3
at java.lang.String.charAt(String.java:694)
at 
org.apache.solr.util.NumberUtils.SortableStr2long(NumberUtils.java:152)
at
org.apache.solr.schema.SortableLongField.toObject(SortableLongField.java:70)
at
org.apache.solr.schema.SortableLongField.toObject(SortableLongField.java:37)
at
org.apache.solr.response.BinaryResponseWriter$Resolver.getDoc(BinaryResponseWriter.java:148)
at
org.apache.solr.response.BinaryResponseWriter$Resolver.writeDocList(BinaryResponseWriter.java:122)
at
org.apache.solr.response.BinaryResponseWriter$Resolver.resolve(BinaryResponseWriter.java:86)
at 
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:144)
at
org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:134)
at
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:222)
at 
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:139)
at 
org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:87)
at
org.apache.solr.response.BinaryResponseWriter.getParsedResponse(BinaryResponseWriter.java:191)
at
org.apache.solr.response.VelocityResponseWriter.write(VelocityResponseWriter.java:57)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:343)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:929)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:964)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:304)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)


what does this mean?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3518171.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread neuron005
Thank you for your replies guys.that helped a lot. Thanks
iorixxx that was the command that worked out.
I also tried my solr with mysql and that worked too. Congo! :)
Now, I want to index my files according to their size and facet them
according to their size ranges. I know that there is an option of fileSize
in FileListEntityProcessor but I am not getting any way to perform this.
Is fileSize a metadata? If it is, then the steps I performed are
-
I created a field name and dynamic field name in schema. as

dynamicField name=metadata_* type=string indexed=true  stored=true
multiValued=false/
field name=fileSize type=string indexed=true stored=true
required=false/

Added range facet in solrconfig.xml and in data-config.xml I added a str
according to field in data-config.xml
--solrconfig.xml
int name=f.fileSize.facet.range.start0/int
   int name=f.fileSize.facet.range.gap100/int
int name=f.fileSize.facet.range.end600/int
---
data-config.xml---
field column=FileSize name=fileSize /
---
But that did not work out!
Am I missing something?
Please help me out.
Thanks in advance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515298.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread Ahmet Arslan
 Now, I want to index my files according to their size and
 facet them
 according to their size ranges. I know that there is an
 option of fileSize
 in FileListEntityProcessor but I am not getting any way to
 perform this.
 Is fileSize a metadata? 

You don't need a dynamic field for this. The following additions should enable 
and populate fileSize:

in data-config.xml :

entity name=f processor=FileListEntityProcessor ... 
field column=fileSize name=fileSize/
/entity

in schema.xml :

field name=fileSize type=string indexed=true stored=true 
required=false/ 





Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread neuron005
Thanks for your reply, I performed these steps.
in data-config.xml :

entity name=f processor=FileListEntityProcessor ... 
field column=fileSize name=fileSize/
/entity

in schema.xml :

field name=fileSize type=string indexed=true stored=true
required=false/
--
But still there is no response in browse sectionI edited facet_ranges.vm
for this. It does not calculate size of the documents. can you please tell
me the command to check that in response it shows size of file?
Thanks again


--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515495.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread neuron005
And also I set my fileSize of type long. String will not work I think !
Size can not be a string...it shows error on using string as type.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515505.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread neuron005
I ran this command and can see size of my files
http://localhost:8080/solr/select?q=userf.fileSize.facet.range.start=100
Great thanks...string worked...i dont know why that did not work last time

But when I do that in browse section..following output i saw in my logs
SEVERE: Exception during facet.range of
fileSize:org.apache.solr.common.SolrException: Unable to range facet on
field:fileSize{type=string,properties=indexed,stored,omitNorms,omitTermFreqAndPositions,sortMissingLast}
at
org.apache.solr.request.SimpleFacets.getFacetRangeCounts(SimpleFacets.java:834)
at
org.apache.solr.request.SimpleFacets.getFacetRangeCounts(SimpleFacets.java:778)
at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:178)
at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
at
org.apache.solr.servl..

This does not come when I set it to type int and when I use int it does not
show size!!
 Please help me out

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515567.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread neuron005
Sorry for disturbing you allactually I had to add plong instead of type
string.
My problem is solved
Be ready for new thread
CHEERS

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515711.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-14 Thread Ahmet Arslan
 Thanks for your reply Mr. Erick
 All I want to do is that I have indexed some of my pdf
 files and doc files.
 Now, any changes I make to them, I want a
 delta-import(incremental) so that
 I do not have to re index whole document by full import .
 Only changes made
 to these documents should get updated. I am using
 dataimporthandler. I
 have seen in forums but all of them have queried for delta
 import related to
 databases. I am just indexing some of my doc and pdf files
 for now.
 What should I do in order to achieve that?

Can you provide your data-config.xml? 


Re: delta-import of rich documents like word and pdf files!

2011-11-14 Thread neuron005
Thanks for your reply...my data-config.xml is
dataConfig
  dataSource type=BinFileDataSource name=bin/
 document
entity name=f pk=id processor=FileListEntityProcessor
recursive=true 
rootEntity=false 
 dataSource=null  baseDir=/var/data/solr 
fileName=.*\.(DOC)|(PDF)|(XML)|(xml)|(JPEG)|(jpg)|(ZIP)|(zip)|(pdf)|(doc)
onError=skip 



entity name=tika-test processor=TikaEntityProcessor 
url=${f.fileAbsolutePath} format=text dataSource=bin onError=skip 
field column=Author name=author meta=true/ 
field column=title name=title meta=true/ 
field column=text name=text/ 
field column=id name=id/
/entity 
field column=file name=fileName/
field column=fileAbsolutePath name=links/
/entity
/document
/dataConfig

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3506404.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-14 Thread Ahmet Arslan

 Thanks for your reply...my
 data-config.xml is
 dataConfig
         dataSource
 type=BinFileDataSource name=bin/
 document
             entity
 name=f pk=id processor=FileListEntityProcessor
 recursive=true 
 rootEntity=false 
  dataSource=null  baseDir=/var/data/solr 
 fileName=.*\.(DOC)|(PDF)|(XML)|(xml)|(JPEG)|(jpg)|(ZIP)|(zip)|(pdf)|(doc)
 onError=skip 
 
 
 
             entity
 name=tika-test processor=TikaEntityProcessor 
 url=${f.fileAbsolutePath} format=text dataSource=bin
 onError=skip 
                
 field column=Author name=author meta=true/ 
                
 field column=title name=title meta=true/ 
                
 field column=text name=text/ 
     field column=id name=id/
 /entity 
      field column=file
 name=fileName/
 field column=fileAbsolutePath name=links/
         /entity
         /document
 /dataConfig

According to wiki : the only EntityProcessor which supports delta is 
SqlEntityProcessor.

May be you can use newerThan parameter of FileListEntityProcessor. Issuing a 
full-import with clean=false may mimic delta import. 

You can pass value of this newerThan parameter in your request.

command=full-importclean=falsemyLastModifiedParam=NOW-3DAYS

http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters




Re: delta-import of rich documents like word and pdf files!

2011-11-14 Thread Erick Erickson
And you cannot update-in-place. That is, you can't update
just selected fields in a document, you have to re-index the
whole document.

Best
Erick

On Mon, Nov 14, 2011 at 6:11 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Thanks for your reply...my
 data-config.xml is
 dataConfig
         dataSource
 type=BinFileDataSource name=bin/
 document
             entity
 name=f pk=id processor=FileListEntityProcessor
 recursive=true
 rootEntity=false
  dataSource=null  baseDir=/var/data/solr
 fileName=.*\.(DOC)|(PDF)|(XML)|(xml)|(JPEG)|(jpg)|(ZIP)|(zip)|(pdf)|(doc)
 onError=skip

 

             entity
 name=tika-test processor=TikaEntityProcessor
 url=${f.fileAbsolutePath} format=text dataSource=bin
 onError=skip

 field column=Author name=author meta=true/

 field column=title name=title meta=true/

 field column=text name=text/
     field column=id name=id/
 /entity
  field column=file
 name=fileName/
 field column=fileAbsolutePath name=links/
         /entity
         /document
 /dataConfig

 According to wiki : the only EntityProcessor which supports delta is 
 SqlEntityProcessor.

 May be you can use newerThan parameter of FileListEntityProcessor. Issuing a 
 full-import with clean=false may mimic delta import.

 You can pass value of this newerThan parameter in your request.

 command=full-importclean=falsemyLastModifiedParam=NOW-3DAYS

 http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters





Re: delta-import of rich documents like word and pdf files!

2011-11-13 Thread neuron005
Thanks for your reply Mr. Erick
All I want to do is that I have indexed some of my pdf files and doc files.
Now, any changes I make to them, I want a delta-import(incremental) so that
I do not have to re index whole document by full import . Only changes made
to these documents should get updated. I am using dataimporthandler. I
have seen in forums but all of them have queried for delta import related to
databases. I am just indexing some of my doc and pdf files for now.
What should I do in order to achieve that?
Thanks in advance

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3505949.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-13 Thread neuron005
and changes are : file content, maybe I change its author and headers

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3505951.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-12 Thread Erick Erickson
Can you give more details about what you're trying to do? It
looks like you're using DataImportHandler? What defines a
document needing to be re-indexed? How do you expect to
be able to identify them???

Perhaps you can review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Sat, Nov 12, 2011 at 4:20 AM, neuron005 neuron...@gmail.com wrote:
 I want to perform delta import of my rich documents like pdf and word  files.
 I added pk=something in my data-config.xml file.
 But now I dont know my next step. How delta-import will come to know which
 fields get updated .I am not getting connected to database. Is there any
 query like database queries of deltaImportQuery and deltaQuery?
  Does anyone has a solution?
 Thanks in advance

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3502039.html
 Sent from the Solr - User mailing list archive at Nabble.com.