RE: index pdf files

Ma, Xiaohui (NIH/NLM/LHC) [C] Thu, 12 Aug 2010 08:59:35 -0700

Thanks so much. I didn't know how to make any changes in schema.xml for pdf 
files. I used solr default schema.xml. Please tell me what I need do in 
schema.xml.


The simple java program I use is following. I also attached that pdf file. I 
really appreciate your help!
*********************************
public class importPDF {
  public static void main(String[] args) {
    try {
        String fileName = "pub2009001.pdf";
        String solrId = "pub2009001.pdf";

      indexFilesSolrCell(fileName, solrId);

    } catch (Exception ex) {
      System.out.println(ex.toString());
    }
  }

 public static void indexFilesSolrCell(String fileName, String solrId)
    throws IOException, SolrServerException {
    String urlString = "http://lhcinternal.nlm.nih.gov:8989/solr/lhcpdf";;
    SolrServer solr = new CommonsHttpSolrServer(urlString);

    ContentStreamUpdateRequest up
      = new ContentStreamUpdateRequest("/update/extract");

    up.addFile(new File(fileName));

    up.setParam("literal.id", solrId);
    up.setParam("uprefix", "attr_");
    up.setParam("fmap.content", "attr_content");

    up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
    solr.request(up);
  }
}
********************************************

-----Original Message-----
From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] 
Sent: Thursday, August 12, 2010 11:45 AM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

To help you we need the description of your fields in your schema.xml and
the query that you do when you search only a single word.

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] <xiao...@mail.nlm.nih.gov>

> I wrote a simple java program to import a pdf file. I can get a result when
> I do search *:* from admin page. I get nothing if I search a word. I wonder
> if I did something wrong or miss set something.
>
> Here is part of result I get when do *:* search:
> *********************************************
> - <doc>
> - <arr name="attr_Author">
>  <str>Hristovski D</str>
>  </arr>
> - <arr name="attr_Content-Type">
>  <str>application/pdf</str>
>  </arr>
> - <arr name="attr_Keywords">
>  <str>microarray analysis, literature-based discovery, semantic
> predications, natural language processing</str>
>  </arr>
> - <arr name="attr_Last-Modified">
>  <str>Thu Aug 12 10:58:37 EDT 2010</str>
>  </arr>
> - <arr name="attr_content">
>  <str>Combining Semantic Relations and DNA Microarray Data for Novel
> Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
> for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
> Kastrin,2...............
> *********************************************
> Please help me out if anyone has experience with pdf files. I really
> appreciate it!
>
> Thanks so much,
>
>

RE: index pdf files

Reply via email to