Re: Problem to start solr-4.0.0-BETA with tomcat-6.0.20
I ma having the same problem after upgrading from 3.2 to 4.0. I have the sharedLib=lib added in the tag and I still get the same error. I deleted all the files from the SOLR home directory and copied the files from 4.0 package. I still see this error. Where else could the old lib files be referenced? -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-to-start-solr-4-0-0-BETA-with-tomcat-6-0-20-tp4002646p4009466.html Sent from the Solr - User mailing list archive at Nabble.com.
avoid overwrite in DataImportHandler
I have a unique ID defined for the documents I am indexing. I want to avoid overwriting the documents that have already been indexed. I am using XPathEntityProcessor and TikaEntityProcessor to process the documents. The DataImportHandler does not seem to have the option to set overwrite=false. I have read some other forums to use deduplication instead but I don't see how it is related to my problem. Any help on this (or explanation on how deduplication would apply to my probelm ) would be great. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html Sent from the Solr - User mailing list archive at Nabble.com.
Error with Extracting PDF metadata
I am using Solr 3.3 and I am trying to extract and index meta data from PDF files. I am using the DataImportHandler with the TikaEntityProcessor to add the documents. Here is are the fields as defined in my schema.xml file: field name=title type=text indexed=true stored=true multiValued=false/ field name=description type=text indexed=true stored=true multiValued=false/ field name=date_published type=string indexed=false stored=true multiValued=false/ field name=link type=string indexed=true stored=true multiValued=false required=false/ field name=imgName type=string indexed=false stored=true multiValued=false required=false/ dynamicField name=attr_* type=textgen indexed=true stored=true multiValued=false/ So I suppose the meta data information should be indexed and stored in fields prefixed as attr_. Here is how my data config file looks. It takes a source directory path from a database, passes it to a FileListEntityProcessor which will pass each of the pdf files found in the directory to the TikaEntityProcessor to extract and index the content. entity onError=skip name=fileSourcePaths rootEntity=false dataSource=dbSource fileName=.*pdf query=select path from file_sources entity name=fileSource processor=FileListEntityProcessor transformer=ThumbnailTransformer baseDir=${fileSourcePaths.path} recursive=true rootEntity=false field name=link column=fileAbsolutePath thumbnail=true/ field name=imgName column=imgName/ entity rootEntity=true onError=abort name=file processor=TikaEntityProcessor url=${fileSource.fileAbsolutePath} dataSource=fileSource format=text field column=resourceName name=title meta=true/ field column=Creation-Date name=date_published meta=true/ field column=text name=description/ /entity /entity It extracts the description and Creation-date just fine but it doesn't seem like it is extracting resourceName and so there is no title field for the documents when I query the index . This is weird because both Creation-date and resourceName are meta data. Also, none of the other possible meta data was being stored under the attr_ fields. I came across some threads which said there are know problems with using Tika 0.8 so I downloaded Tika 0.9 and replaced it over 0.8. I also downloaded and replaced pdfbox, jempbox and fontbox from 1.3 to 1.4. I tested one of the pdf's separately with just Tika to see what meta data is stored with the file. This is what I found: Content-Length: 546459 Content-Type: application/pdf Creation-Date: 2010-06-09T12:11:12Z Last-Modified: 2010-06-09T14:53:38Z created: Wed Jun 09 08:11:12 EDT 2010 creator: XSL Formatter V4.3 MR9a (4,3,2009,1022) for Windows producer: Antenna House PDF Output Library 2.6.0 (Windows) resourceName: Argentina.pdf trapped: False xmpTPg:NPages: 2 As you can see, it does have a resourceName meta data. I tried indexing again but I got the same result. Creation-date extracts and indexes just fine but not resourceName. Also the rest of the attributes are not being indexed under the attr_ fields. Whats going wrong? -- View this message in context: http://lucene.472066.n3.nabble.com/Error-with-Extracting-PDF-metadata-tp3210813p3210813.html Sent from the Solr - User mailing list archive at Nabble.com.
Using Scriptransformer to send a HTTP Request
I am using solr to index RSS feeds and I am using DataImportHandler to parse the urls and then index them. Now I have implemented a web service that takes a url and creates an thumbnail image and stores it in a local directory. So here is what I want to do: After the url is parsed, I want to send a Http request to the web service with the URL. ScriptTransformer seemed the way to go and here is how my data-config.xml file looks. dataConfig dataSource type=JdbcDataSource name=dbSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/solr_sources user=root password=**/ document entity name=rssFeedItems rootEntity=false dataSource=dbSource query=select url from rss_feeds entity name=rssFeeds dataSource=urlSource url=${rssFeedItems.url} transformer=script:sendURLRequest processor=XPathEntityProcessor forEach=/rss/channel/item field column=titlexpath=/rss/channel/item/title/ field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description / field column=date_published xpath=/rss/channel/item/pubDate/ /entity /entity . As you can see from the data-config file, I am currently testing to see if this would work by hard coding a dummy URL. url.openConnection().connect(); Should make the HTTP Request. But the image is not generated. I see no compile errors. I tried the example script of printing out a message var v = new java.lang.Runnable() { run: function() { print('PRINTING'); } } v.run(); And it worked. I even played around with the function names to force it throw some compile errors and it did throw errors which shows that it is able to create the objects of class type URL and URL Connection. Any suggestions? -- View this message in context: http://lucene.472066.n3.nabble.com/Using-Scriptransformer-to-send-a-HTTP-Request-tp3189479p3189479.html Sent from the Solr - User mailing list archive at Nabble.com.
Indexing PDF documents with no UniqueKey
I want to index PDF (and other rich) documents. I am using the DataImportHandler. Here is how my schema.xml looks: . . field name=title type=text indexed=true stored=true multiValued=false/ field name=description type=text indexed=true stored=true multiValued=false/ field name=date_published type=string indexed=false stored=true multiValued=false/ field name=link type=string indexed=true stored=true multiValued=false required=false/ dynamicField name=attr_* type=textgen indexed=true stored=true multiValued=false/ uniqueKeylink/uniqueKey As you can see I have set link as the unique key so that when the indexing happens documents are not duplicated again. Now I have the file paths stored in a database and I have set the DataImportHandler to get a list of all the file paths and index each document. To test it I used the tutorial.pdf file that comes with example docs in Solr. The problem is of course this pdf document won't have a field 'link'. I am thinking of way how I can manually set the file path as link when indexing these documents. I tried the data-config settings as below, entity name=fileItems rootEntity=false dataSource=dbSource query=select path from file_paths entity name=tika-test processor=TikaEntityProcessor url=${fileItems.path} dataSource=fileSource field column=title name=title meta=true/ field column=Creation-Date name=date_published meta=true/ entity name=filePath dataSource=dbSource query=SELECT path FROM file_paths as link where path = '${fileItems.path}' field column=link name=link/ /entity /entity /entity where I create a sub-entity which queries for the path name and makes it return the results in a column titled 'link'. But I still see this error: WARNING: Error creating document : SolrInputDocument[{date_published=date_published(1.0)={2011-06-23T12:47:45Z}, title=title(1.0)={Solr tutorial}}] org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: link Is there anyway for me to create a field called link for the pdf documents? This was already asked http://lucene.472066.n3.nabble.com/Trouble-with-exception-Document-Null-missing-required-field-DocID-td1641048.html here before but the solution provided uses ExtractRequestHandler but I want to use it through the DataImportHandler. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-PDF-documents-with-no-UniqueKey-tp3173272p3173272.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Updating the data-config file
Thanks. I will look into this and see how it goes. -- View this message in context: http://lucene.472066.n3.nabble.com/Updating-the-data-config-file-tp3101241p3104470.html Sent from the Solr - User mailing list archive at Nabble.com.
Updating the data-config file
So I have some RSS feeds that I want to index using Solr. I am using the DataImportHandler and I have added the instructions on how to parse the feeds in the data-config file. Now if a user wants to add more RSS feeds to index, do I have to programatically instruct Solr to update the config file? Is there a HTTP POST or GET I can send to update the data-config file? -- View this message in context: http://lucene.472066.n3.nabble.com/Updating-the-data-config-file-tp3101241p3101241.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Updating the data-config file
So you mean I cannot update the data-config programmatically? I don't understand how the request parameters be of use to me. This is how my data-config file looks: dataConfig dataSource type=HttpDataSource / document entity name=slashdot pk=link url=http://rss.slashdot.org/Slashdot/slashdot; processor=XPathEntityProcessor forEach=/RDF/channel | /RDF/item transformer=DateFormatTransformer field column=titlexpath=/RDF/item/title / field column=link xpath=/RDF/item/link / field column=description xpath=/RDF/item/description / field column=date xpath=/RDF/item/date dateTimeFormat=-MM-dd'T'hh:mm:ss / /entity /document /dataConfig I am running a Flash based application as the front end UI to show the search results. Now I want the user to be able to add new RSS feed data sources. -- View this message in context: http://lucene.472066.n3.nabble.com/Updating-the-data-config-file-tp3101241p3101530.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Updating the data-config file
Ahh! Thats interesting! I understand what you mean. Since RSS and Atom feeds have the same structure parsing them would be the same but I can do the for each different URLs. These URLs can be obtained from a db, a file or through the request parameters, right? -- View this message in context: http://lucene.472066.n3.nabble.com/Updating-the-data-config-file-tp3101241p3102225.html Sent from the Solr - User mailing list archive at Nabble.com.
DIH Scheduling
There is information http://wiki.apache.org/solr/DataImportHandler#Scheduling here about Scheduling but I don't understand how to use them. I am not a Java developer so maybe I am missing something obvious. Based on instructions http://stackoverflow.com/questions/3206171/how-can-i-schedule-data-imports-in-solr/6379306#6379306 here , it says Create classes ApplicationListener, HTTPPostScheduler and SolrDataImportProperties. Where do I create them and how do I add them to solr? -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Scheduling-tp3091764p3091764.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Scheduling
Thanks. Using curl would be an option but ideally I want to implement it using this scheduler. I want to add Solr as part of another application package and send it to clients. So rather than asking them run a cron job it would be easier to have Solr configured to run the scheduler. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Scheduling-tp3091764p3092985.html Sent from the Solr - User mailing list archive at Nabble.com.