I am new to both Solr and Cell, so sorry if I am misusing some of the 
terminologies. So the problem I am trying to solve is to index a PDF document 
using Solr Cell where I want to exclude part of it via XPATH. I am using Solr 
release 3.1. When researching the user list, I came across one entry on this 
topic titled 'XPath query support in Solr Cell' which clarify one issue, but 
still I am having problem getting what I want.

Here is what I have done so far:

First, I started by executing the following 'CURL' command to see what I would 
get:

curl 
"http://localhost:8983/solr/docs/update/extract?literal.id=123&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()&extractOnly=true"
 -F "file=@/docs/test.pdf"

This worked fine. Next I tried getting the first DIV element by modifying the 
XPATH query as follows:

curl 
"http://localhost:8983/solr/docs/update/extract?literal.id=123&xpath=/xhtml:html/xhtml:body/xhtml:div\[1\]/descendant:node()&extractOnly=true"
 -F "file=@/docs/test.pdf"

Note, I am escaping the '[]', I even tried using their encoded values %5B and 
%5D. It ran, but it did not match anything. Here is was I got:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">627</int>
</lst><str name="test.pdf"/><lst name="test.pdf_metadata"><arr name="s
tream_source_info"><str>file</str></arr><arr name="subject"><str>Version A-2007.
12</str></arr><arr name="Last-Modified"><str>2009-08-12T17:07:27Z</str></arr><ar
r name="Author"><str>Test title.</str></arr><arr name="
creator"><str>FrameMaker 7.1</str></arr><arr name="xmpTPg:NPages"><str>187</str>
</arr><arr name="Creation-Date"><str>2009-08-12T17:07:27Z</str></arr><arr name="
title"><str>Test Document</str></arr><arr name="stream_content_type"><str
>application/octet-stream</str></arr><arr name="created"><str>Wed Aug 12 10:07:2
7 PDT 2009</str></arr><arr name="stream_size"><str>1372769</str></arr><arr name=
"stream_name"><str>test.pdf</str></arr><arr name="producer"><str>Acrobat Di
stiller 7.0.5 (Windows)</str></arr><arr name="Copyright"><str>2007</str></arr><a
rr name="Content-Type"><str>application/pdf</str></arr><arr name="Keywords"><str
>Test</str></arr></lst>
</response>

On a different track I explored what could be an XPATH expression for my 
purpose. Here I have something that should get me there most of the way:

//xhtml:body/xhtml:div\[not(contains(p,'EXCLUDE TEXT'))\]

I independently validated the XPATH expression at following URL:

http://www.whitebeam.org/library/guide/TechNotes/xpathtestbed.rhtm

As was suggested in previously mentioned posting.

Any suggestion and help is greatly appreciated.


Reply via email to