No, I am using xpath for html, this is not the question. I am indexing pure text in addition to html that I was indexing. Pure text like TXT file or Microsoft Word doc. So, no xpath for TXT, how do I index TXT file into different fields in my index like the way I use xpath to index html into differernt fields in my index?
My question is referring to pure TXT like .txt file and microsoft word, not html. I am completely fine with html. Thanks. ________________________________ From: Erick Erickson <erickerick...@gmail.com> To: solr-user@lucene.apache.org Sent: Wed, September 29, 2010 2:59:26 PM Subject: Re: How to Index Pure Text into Seperate Fields? Can you provide a few more details? You mention xpath, which leads me to believe that you are using DIH, is that true? How are you getting your documents to index? Parts of a filesystem? Because it's possible to do many things. If you're using DIH against a filesystem, you could use two fileDataSources, one that works only on files with a particular extension (xml, say) and another that processes .txt files. But that said, if you're trying to index "just the text" of a Word document, you have to parse it quite differently than a plain text file, take a look at Tika. Al of which may not help you at all, because I'm guessing... So I think a more complete problem statement would help us help you. Best Erick On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett < savannah_becket...@yahoo.com> wrote: > Hi, > I am using xpath to index different parts of the html pages into > different > fields. Now, I have some pure text documents that has no html. So I can't > use > xpath. How do I index these pure text into different fields of the index? > How > do I make nutch/solr understand these different parts belong to different > fields? Maybe I can use existing content in the fields in my index? > Thanks. > > >