On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote: > > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote: > > > > I am a little confused how you have things setup, so these meta data > > files contain certain information and there may or may not be a pdf, > > xls, doc that it is associated with? > > > Yes, you have it right. > > If that is the case, if it were me I would write something to parse > > the meta data files, and if there is a binary file associated with it > > submit it using the url I showed you. If the meta data is just that > > and has no associated documents submit it in XML form. The script > > shouldn't be too complicated, but that would depend on the complexity > > of the meta data you are parsing. > > > > To give you an idea how I use it, we have hundreds of documents in > > PDF, DOC, XLS, HTML, TXT, CSV, and PPT formats. When a document is to > > be indexed by solr we look at the extension, if it is a txt or > > html,htm we read the data in and submit it with the xml handler. If > > the document is one of the binary formats we submit it with the url I > > showed you. All information about these files is stored in a database > > and some of the 'documents' in the database are just links to external > > documents. In that case we are only indexing a description, title, > > and category. > > > > You are correct, it would overwrite the data by doing an update unless > > you parsed the meta data, and if you are parsing the meta data you > > might as well just parse it from the start and index once. > > > > How are you handling these meta data files right now? are they simply > > xml files like in the solr example where you are just running the bash > > script on or is something parsing the contents already? > > > Yes, I am running a similar bash script to index these meta-data xml docs. > The big downside in using the url way is that, for one thing, it has the > characters-limit (1024, is it?). So, if I had a lot of meta-data, or even a > long description for a record, that might not work all that well. I am > guessing you haven't run into this issue yet, right? > > - Pete > > > The proposed schema additions might not make sense for everyone, since the > actual requirements might be more complex than just that (i.e., say you > want to extract text, structure it in various elements, update your doc xml, > and then index). But, it goes well with Solr's search-engine-in-a-box > perception, but now with full-text- prefix to it. Another way I can see it > happen, is to extend the default handler and still take in a xml doc, but > look out for, say, a field name '<file>'. From here on, within the handler, > you can validate the filename, handle it anyways you want (create extra > elements, create '<pdf>' for pdf files and '<html>' for html files, etc..), > etc... This strips out having to deal with if/else scripting outside of > Solr. >
Of course, I meant to create new handler and not just modify the standard one (which is a big no-no, I know that!). Perhaps, name it '/update/file'. Rao > > > > On 8/21/07, Vish D. <[EMAIL PROTECTED] > wrote: > > > Pete, > > > > > > Thanks for the great explanation. > > > > > > Thinking it through my process, I am not sure how to use it: > > > > > > I have a bunch of docs that pretty much contain a lot of meta-data, > > some > > > which include full-text files (.pdf, .ppt, etc...). I use these docs > > > correctly to index/update into Solr. The next step now is to somehow > > index > > > the text from the full-text files. One way to think about it is, I > > could > > > have a placeholder field 'data' and keep it empty for the first pass, > > and > > > then run update/rich to index the actual full-text, but using the same > > > unique doc id. But this would actually overwrite the doc in the index, > > won't > > > it? And, there really isn't a 'merge' operation, right? > > > > > > There might be a better way to use this full-text indexing option, > > > schema-wise, say: > > > <richData source="FIELDNAME" dest="FIELDNAME" /> > > > - have a new option richData that will take in a source field name, > > > - validate it's value (valid filename/file), > > > - recognize the file type, > > > - and put the 'data' into another field > > > > > > What do you think? I am not a true Java developer, so not sure if I > > could > > > do it myself, but only hope that someone else on the project could > > ;-)... > > > > > > Rao > > > > > > On 8/21/07, Peter Manis < [EMAIL PROTECTED]> wrote: > > > > > > > > Installing the patch requires downloading the latest solr via > > > > subversion and applying the patch to the source. Eric has updated > > his > > > > patch with various revisions of subversion. To make sure it will > > > > compile I suggest getting the revision he lists. > > > > > > > > As for using the features of this patch. This is the url that would > > be > > > > called > > > > > > > > > > > > > > /solr/update/rich?stream.file=filename&stream.type=filetype&id=id&stream.fieldname=storagefield&fieldnames=cat,desc,type,name&type=filetype&cat=category&name=name&desc=description > > > > > > > > > > Breaking this down > > > > > > > > You have stream.file which will be the absolute path to the file you > > > > want to index. You then have stream.type which specifies the type > > of > > > > file, which currently supports pdf, xls, doc, ppt. The next field > > is > > > > the id, which is where you specify the unique value for the id in > > your > > > > schema. Example is we had a document reference in a database, and > > > > that id was 103, so we would specify the value 103 to identify which > > > > document it was in the index. Stream.fieldname is the name of the > > > > field in your index that will actually be storing the text from the > > > > document. We had the field 'data' so it would be > > > > stream.fieldname=data in the url. > > > > > > > > The parameter fieldnames is any additional fields in your index that > > > > need to be filled. We were passing a category, description for the > > > > document, a name, and the type. So you just need to specify the > > names > > > > of the fields. Solr will then look for corresponding parameters > > with > > > > those names, which you can see at the end of my URL. The values > > > > passed for the additional parameters need to be sent url encoded. > > > > > > > > I'm not a Java programmer so if you have questions about the > > internals > > > > of the code, definitely direct those to Eric as I cannot help. I > > have > > > > only implemented it in web applications. If you have any other > > > > questions about the use of the patch I can answer those questions. > > > > > > > > Enjoy! > > > > > > > > - Pete > > > > > > > > On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote: > > > > > There seems to be some code out for Tika now (not > > packaged/announced > > > > yet, > > > > > but...). Could someone please take a look at it and see if that > > could > > > > fit > > > > > in? I am eagerly waiting for a reply back from tika-dev, but no > > luck > > > > yet. > > > > > > > > > > > > > > > > http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/ > > > > > > > > > > I see that Eric's patch uses POI (for most of it)...so that's > > great! I > > > > have > > > > > seen too many duplicated efforts, even in Apache projects alone, > > and > > > > this is > > > > > one step close to fixing it (other than Tika, which isnt' > > 'complete' > > > > yet). > > > > > Are there any plans on releasing this patch with Solr dist? Or, > > any > > > > > instructions on using/installing the patch itself? > > > > > > > > > > Thanks > > > > > Vish > > > > > > > > > > > > > > > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > Christian, > > > > > > > > > > > > Eric Pugh created implemented this functionality for a project > > we were > > > > > > doing and has released to code on JIRA. We have had very good > > results > > > > > > with it. If I can be of any help using it beyond the Java code > > itself > > > > > > let me know. The last revision I used with it was 552853, so if > > the > > > > > > build happens to fail you can roll back to that and it will > > work. > > > > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-284 > > > > > > > > > > > > - Pete > > > > > > > > > > > > On 8/21/07, Christian Klinger <[EMAIL PROTECTED]> wrote: > > > > > > > Hi Solr Users, > > > > > > > > > > > > > > i have set up a Solr-Server with a custom Schema. > > > > > > > Now i have updated the index with some content form > > > > > > > xml-files. > > > > > > > > > > > > > > Now i try to update the contents of a folder. > > > > > > > The folder consits of various document-types > > > > > > > (pdf,doc,xls,...). > > > > > > > > > > > > > > Is there anywhere an howto how can i parse the > > > > > > > documents, make an xml of the paresed content > > > > > > > and post it to the solr server? > > > > > > > > > > > > > > Thanks in advance. > > > > > > > > > > > > > > Christian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >