Re: Problem to start solr-4.0.0-BETA with tomcat-6.0.20

2012-09-21 Thread sabman
I ma having the same problem after upgrading from 3.2 to 4.0. I have the
sharedLib=lib added in the tag and I still get the same error. I deleted
all the files from the SOLR home directory and copied the files from 4.0
package. I still see this error. Where else could the old lib files be
referenced?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-to-start-solr-4-0-0-BETA-with-tomcat-6-0-20-tp4002646p4009466.html
Sent from the Solr - User mailing list archive at Nabble.com.


avoid overwrite in DataImportHandler

2011-12-07 Thread sabman
I have a unique ID defined for the documents I am indexing. I want to avoid
overwriting the documents that have already been indexed. I am using
XPathEntityProcessor and TikaEntityProcessor to process the documents.

The DataImportHandler does not seem to have the option to set
overwrite=false. I have read some other forums to use deduplication instead
but I don't see how it is related to my problem. 

Any help on this (or explanation on how deduplication would apply to my
probelm ) would be great. Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html
Sent from the Solr - User mailing list archive at Nabble.com.


Error with Extracting PDF metadata

2011-07-29 Thread sabman
I am using Solr 3.3 and I am trying to extract and index meta data from PDF
files. I am using the DataImportHandler with the TikaEntityProcessor to add
the documents. Here is are the fields as defined in my schema.xml file:


field name=title type=text indexed=true stored=true
multiValued=false/
   field name=description type=text indexed=true stored=true
multiValued=false/
   field name=date_published type=string indexed=false stored=true
multiValued=false/
   field name=link type=string indexed=true stored=true
multiValued=false required=false/
   field name=imgName type=string indexed=false stored=true
multiValued=false required=false/
   dynamicField name=attr_* type=textgen indexed=true stored=true
multiValued=false/

So I suppose the meta data information should be indexed and stored in
fields prefixed as attr_.

Here is how my data config file looks. It takes a source directory path from
a database, passes it to a FileListEntityProcessor which will pass each of
the pdf files found in the directory to the TikaEntityProcessor to extract
and index the content.

entity onError=skip name=fileSourcePaths rootEntity=false
dataSource=dbSource fileName=.*pdf query=select path from
file_sources
  entity name=fileSource processor=FileListEntityProcessor
transformer=ThumbnailTransformer baseDir=${fileSourcePaths.path}
recursive=true rootEntity=false
field name=link column=fileAbsolutePath thumbnail=true/
field name=imgName column=imgName/
entity rootEntity=true onError=abort name=file
processor=TikaEntityProcessor url=${fileSource.fileAbsolutePath}
dataSource=fileSource format=text
  field column=resourceName name=title meta=true/
  field column=Creation-Date name=date_published meta=true/
  field column=text name=description/
/entity
  /entity

It extracts the description and Creation-date just fine but it doesn't seem
like it is extracting resourceName and so  there is no title field for the
documents when I query the index . This is weird because both Creation-date
and resourceName are meta data. Also, none of the other possible meta data
was being stored under the attr_ fields. I came across some threads which
said there are know problems with using Tika 0.8 so I downloaded Tika 0.9
and replaced it over 0.8. I also downloaded and replaced pdfbox, jempbox and
fontbox from 1.3 to 1.4. 

I tested one of the pdf's separately with just Tika to see what meta data is
stored with the file. This is what I found:

Content-Length: 546459
Content-Type: application/pdf
Creation-Date: 2010-06-09T12:11:12Z
Last-Modified: 2010-06-09T14:53:38Z
created: Wed Jun 09 08:11:12 EDT 2010
creator: XSL Formatter V4.3 MR9a (4,3,2009,1022) for Windows
producer: Antenna House PDF Output Library 2.6.0 (Windows)
resourceName: Argentina.pdf
trapped: False
xmpTPg:NPages: 2


As you can see, it does have a resourceName meta data. I tried indexing
again but I got the same result. Creation-date extracts and indexes just
fine but not resourceName. Also the rest of the attributes are not being
indexed under the attr_ fields.

Whats going wrong?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-with-Extracting-PDF-metadata-tp3210813p3210813.html
Sent from the Solr - User mailing list archive at Nabble.com.


Using Scriptransformer to send a HTTP Request

2011-07-21 Thread sabman
I am using solr to index RSS feeds and I am using DataImportHandler to parse
the urls and then index them. Now I have implemented a web service that
takes a url and creates an thumbnail image and stores it in a local
directory.

So here is what I want to do: After the url is parsed, I want to send a Http
request to the web service with the URL. ScriptTransformer seemed the way to
go and here is how my data-config.xml file looks.

dataConfig


  dataSource type=JdbcDataSource name=dbSource
driver=com.mysql.jdbc.Driver 
url=jdbc:mysql://localhost/solr_sources user=root password=**/



  document

entity name=rssFeedItems rootEntity=false  dataSource=dbSource 
query=select url from rss_feeds

  entity name=rssFeeds dataSource=urlSource
url=${rssFeedItems.url} transformer=script:sendURLRequest
processor=XPathEntityProcessor forEach=/rss/channel/item
field column=titlexpath=/rss/channel/item/title/
field column=link xpath=/rss/channel/item/link /
field column=description  xpath=/rss/channel/item/description
/
field column=date_published xpath=/rss/channel/item/pubDate/
  /entity
/entity
.



As you can see from the data-config file, I am currently testing to see if
this would work by hard coding a dummy URL. 

url.openConnection().connect(); Should make the HTTP Request. But the image
is not generated.

I see no compile errors. I tried the example script of printing out a
message 

var v = new java.lang.Runnable() {
run: function() {
print('PRINTING'); }
   }
   v.run();

And it worked. 

I even played around with the function names to force it throw some compile
errors and it did throw errors which shows that it is able to create the
objects of class type URL and URL Connection.

Any suggestions?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Scriptransformer-to-send-a-HTTP-Request-tp3189479p3189479.html
Sent from the Solr - User mailing list archive at Nabble.com.


Indexing PDF documents with no UniqueKey

2011-07-15 Thread sabman
I want to index PDF (and other rich) documents. I am using the
DataImportHandler.

Here is how my schema.xml looks:

.
.
 field name=title type=text indexed=true stored=true
multiValued=false/
   field name=description type=text indexed=true stored=true
multiValued=false/
   field name=date_published type=string indexed=false stored=true
multiValued=false/
   field name=link type=string indexed=true stored=true
multiValued=false required=false/
   dynamicField name=attr_* type=textgen indexed=true stored=true
multiValued=false/


uniqueKeylink/uniqueKey


As you can see I have set link as the unique key so that when the indexing
happens documents are not duplicated again. Now I have the file paths stored
in a database and I have set the DataImportHandler to get a list of all the
file paths and index each document. To test it I used the tutorial.pdf file
that comes with  example docs in Solr. The problem is of course this pdf
document won't have a field 'link'. I am thinking of way how I can manually
set the file path as link when indexing these documents. I tried the
data-config settings as below, 

 entity name=fileItems  rootEntity=false dataSource=dbSource
query=select path from file_paths
   entity name=tika-test processor=TikaEntityProcessor
url=${fileItems.path} dataSource=fileSource
 field column=title name=title meta=true/
 field column=Creation-Date name=date_published meta=true/
 entity name=filePath dataSource=dbSource query=SELECT path FROM
file_paths as link where path = '${fileItems.path}'
   field column=link name=link/
 /entity
   /entity
  /entity


where I create a sub-entity which queries for the path name and makes it
return the results in a column titled 'link'. But I still see this error:

WARNING: Error creating document :
SolrInputDocument[{date_published=date_published(1.0)={2011-06-23T12:47:45Z},
title=title(1.0)={Solr tutorial}}]
org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: link

Is there anyway for me to create a field called link for the pdf documents?



This was already asked 
http://lucene.472066.n3.nabble.com/Trouble-with-exception-Document-Null-missing-required-field-DocID-td1641048.html
here  before but the solution provided uses ExtractRequestHandler but I want
to use it through the DataImportHandler.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-PDF-documents-with-no-UniqueKey-tp3173272p3173272.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Updating the data-config file

2011-06-24 Thread sabman
Thanks. I will look into this and see how it goes.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Updating-the-data-config-file-tp3101241p3104470.html
Sent from the Solr - User mailing list archive at Nabble.com.


Updating the data-config file

2011-06-23 Thread sabman
So I have some RSS feeds that I want to index using Solr. I am using the
DataImportHandler and I have added the instructions on how to parse the
feeds in the data-config file. 

Now if a user wants to add more RSS feeds to index, do I have to
programatically instruct Solr to update the config file? Is there a HTTP
POST or GET I can send to update the data-config file?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Updating-the-data-config-file-tp3101241p3101241.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Updating the data-config file

2011-06-23 Thread sabman
So you mean I cannot update the data-config programmatically? I don't
understand how the request parameters be of use to me.

This is how my data-config file looks:


dataConfig
dataSource type=HttpDataSource /
document
entity name=slashdot
pk=link
url=http://rss.slashdot.org/Slashdot/slashdot;
processor=XPathEntityProcessor
forEach=/RDF/channel | /RDF/item
transformer=DateFormatTransformer

field column=titlexpath=/RDF/item/title
/
field column=link xpath=/RDF/item/link
/
field column=description 
xpath=/RDF/item/description /
field column=date xpath=/RDF/item/date
dateTimeFormat=-MM-dd'T'hh:mm:ss /
/entity
/document
/dataConfig

I am running a Flash based application as the front end UI to show the
search results. Now I want the user to be able to add new RSS feed data
sources. 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Updating-the-data-config-file-tp3101241p3101530.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Updating the data-config file

2011-06-23 Thread sabman
Ahh! Thats interesting!

I understand what you mean. Since RSS and Atom feeds have the same structure
parsing them would be the same but I can do the for each different URLs.
These URLs can be obtained from a db, a file or through the request
parameters, right?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Updating-the-data-config-file-tp3101241p3102225.html
Sent from the Solr - User mailing list archive at Nabble.com.


DIH Scheduling

2011-06-21 Thread sabman
There is information 
http://wiki.apache.org/solr/DataImportHandler#Scheduling here  about
Scheduling but I don't understand how to use them. I am not a Java developer
so maybe I am missing something obvious. 

Based on instructions 
http://stackoverflow.com/questions/3206171/how-can-i-schedule-data-imports-in-solr/6379306#6379306
here , it says Create classes ApplicationListener, HTTPPostScheduler and
SolrDataImportProperties. Where do I create them and how do I add them to
solr?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Scheduling-tp3091764p3091764.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH Scheduling

2011-06-21 Thread sabman
Thanks. Using curl would be an option but ideally I want to implement it
using this scheduler. I want to add Solr as part of another application
package and send it to clients. So rather than asking them run a cron job it
would be easier to have Solr configured to run the scheduler.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Scheduling-tp3091764p3092985.html
Sent from the Solr - User mailing list archive at Nabble.com.