Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

Vish D. Tue, 21 Aug 2007 13:14:51 -0700

On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote:
>
> On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> >
> > I am a little confused how you have things setup, so these meta data
> > files contain certain information and there may or may not be a pdf,
> > xls, doc that it is associated with?
>
>
> Yes, you have it right.
>
> If that is the case, if it were me I would write something to parse
> > the meta data files, and if there is a binary file associated with it
> > submit it using the url I showed you.  If the meta data is just that
> > and has no associated documents submit it in XML form.  The script
> > shouldn't be  too complicated, but that would depend on the complexity
> > of the meta data you are parsing.
> >
> > To give you an idea how I use it, we have hundreds of documents in
> > PDF, DOC, XLS, HTML, TXT, CSV, and PPT formats.  When a document is to
> > be indexed by solr we look at the extension, if it is a txt or
> > html,htm we read the data in and submit it with the xml handler.  If
> > the document is one of the binary formats we submit it with the url I
> > showed you.  All information about these files is stored in a database
> > and some of the 'documents' in the database are just links to external
> > documents.  In that case we are only indexing a description, title,
> > and category.
> >
> > You are correct, it would overwrite the data by doing an update unless
> > you parsed the meta data, and if you are parsing the meta data you
> > might as well just parse it from the start and index once.
> >
> > How are you handling these meta data files right now?  are they simply
> > xml files like in the solr example where you are just running the bash
> > script on or is something parsing the contents already?
>
>
> Yes, I am running a similar bash script to index these meta-data xml docs.
> The big downside in using the url way is that, for one thing, it has the
> characters-limit (1024, is it?). So, if I had a lot of meta-data, or even a
> long description for a record, that might not work all that well. I am
> guessing you haven't run into this issue yet, right?
>
> - Pete
>
>
> The proposed schema additions might not make sense for everyone, since the
> actual requirements might be more complex than just that (i.e., say you
> want to extract text, structure it in various elements, update your doc xml,
> and then index). But, it goes well with Solr's search-engine-in-a-box
> perception, but now with full-text- prefix to it. Another way I can see it
> happen, is to extend the default handler and still take in a xml doc, but
> look out for, say, a field name '<file>'. From here on, within the handler,
> you can validate the filename, handle it anyways you want (create extra
> elements, create '<pdf>' for pdf files and '<html>' for html files, etc..),
> etc... This strips out having to deal with if/else scripting outside of
> Solr.
>


Of course, I meant to create new handler and not just modify the standard
one (which is a big no-no, I know that!). Perhaps, name it '/update/file'.

Rao
>
>
>
> On 8/21/07, Vish D. <[EMAIL PROTECTED] > wrote:
> > > Pete,
> > >
> > > Thanks for the great explanation.
> > >
> > > Thinking it through my process, I am not sure how to use it:
> > >
> > > I have a bunch of docs that pretty much contain a lot of meta-data,
> > some
> > > which include full-text files (.pdf, .ppt, etc...). I use these docs
> > > correctly to index/update into Solr. The next step now is to somehow
> > index
> > > the text from the full-text files. One way to think about it is, I
> > could
> > > have a placeholder field 'data' and keep it empty for the first pass,
> > and
> > > then run update/rich to index the actual full-text, but using the same
> > > unique doc id. But this would actually overwrite the doc in the index,
> > won't
> > > it? And, there really isn't a 'merge' operation, right?
> > >
> > > There might be a better way to use this full-text indexing option,
> > > schema-wise, say:
> > > <richData source="FIELDNAME" dest="FIELDNAME" />
> > > - have a new option richData that will take in a source field name,
> > > - validate it's value (valid filename/file),
> > > - recognize the file type,
> > > - and put the 'data' into another field
> > >
> > > What do you think?  I am not a true Java developer, so not sure if I
> > could
> > > do it myself, but only hope that someone else on the project could
> > ;-)...
> > >
> > > Rao
> > >
> > > On 8/21/07, Peter Manis < [EMAIL PROTECTED]> wrote:
> > > >
> > > > Installing the patch requires downloading the latest solr via
> > > > subversion and applying the patch to the source.  Eric has updated
> > his
> > > > patch with various revisions of subversion.  To make sure it will
> > > > compile I suggest getting the revision he lists.
> > > >
> > > > As for using the features of this patch.  This is the url that would
> > be
> > > > called
> > > >
> > > >
> > > >
> > /solr/update/rich?stream.file=filename&stream.type=filetype&id=id&stream.fieldname=storagefield&fieldnames=cat,desc,type,name&type=filetype&cat=category&name=name&desc=description
> >
> > > >
> > > > Breaking this down
> > > >
> > > > You have stream.file which will be the absolute path to the file you
> > > > want to index.  You then have stream.type which specifies the type
> > of
> > > > file, which currently supports pdf, xls, doc, ppt.  The next field
> > is
> > > > the id, which is where you specify the unique value for the id in
> > your
> > > > schema.  Example is we had a document reference in a database, and
> > > > that id was 103, so we would specify the value 103 to identify which
> > > > document it was in the index.  Stream.fieldname is the name of the
> > > > field in your index that will actually be storing the text from the
> > > > document.  We had the field 'data' so it would be
> > > > stream.fieldname=data in the url.
> > > >
> > > > The parameter fieldnames is any additional fields in your index that
> > > > need to be filled.  We were passing a category, description for the
> > > > document, a name, and the type.  So you just need to specify the
> > names
> > > > of the fields.  Solr will then look for corresponding parameters
> > with
> > > > those names, which you can see at the end of my URL.  The values
> > > > passed for the additional parameters need to be sent url encoded.
> > > >
> > > > I'm not a Java programmer so if you have questions about the
> > internals
> > > > of the code, definitely direct those to Eric as I cannot help.  I
> > have
> > > > only implemented it in web applications.  If you have any other
> > > > questions about the use of the patch I can answer those questions.
> > > >
> > > > Enjoy!
> > > >
> > > > - Pete
> > > >
> > > > On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote:
> > > > > There seems to be some code out for Tika now (not
> > packaged/announced
> > > > yet,
> > > > > but...). Could someone please take a look at it and see if that
> > could
> > > > fit
> > > > > in? I am eagerly waiting for a reply back from tika-dev, but no
> > luck
> > > > yet.
> > > > >
> > > > >
> > > >
> > http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/
> > > > >
> > > > > I see that Eric's patch uses POI (for most of it)...so that's
> > great! I
> > > > have
> > > > > seen too many duplicated efforts, even in Apache projects alone,
> > and
> > > > this is
> > > > > one step close to fixing it (other than Tika, which isnt'
> > 'complete'
> > > > yet).
> > > > > Are there any plans on releasing this patch with Solr dist? Or,
> > any
> > > > > instructions on using/installing the patch itself?
> > > > >
> > > > > Thanks
> > > > > Vish
> > > > >
> > > > >
> > > > > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > > Christian,
> > > > > >
> > > > > > Eric Pugh created implemented this functionality for a project
> > we were
> > > > > > doing and has released to code on JIRA.  We have had very good
> > results
> > > > > > with it.  If I can be of any help using it beyond the Java code
> > itself
> > > > > > let me know.  The last revision I used with it was 552853, so if
> > the
> > > > > > build happens to fail you can roll back to that and it will
> > work.
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/SOLR-284
> > > > > >
> > > > > > - Pete
> > > > > >
> > > > > > On 8/21/07, Christian Klinger <[EMAIL PROTECTED]> wrote:
> > > > > > > Hi Solr Users,
> > > > > > >
> > > > > > > i have set up a Solr-Server with a custom Schema.
> > > > > > > Now i have updated the index with some content form
> > > > > > > xml-files.
> > > > > > >
> > > > > > > Now i try to update the contents of a folder.
> > > > > > > The folder consits of various document-types
> > > > > > > (pdf,doc,xls,...).
> > > > > > >
> > > > > > > Is there anywhere an howto how can i parse the
> > > > > > > documents, make an xml of the paresed content
> > > > > > > and post it to the solr server?
> > > > > > >
> > > > > > > Thanks in advance.
> > > > > > >
> > > > > > > Christian
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

Reply via email to