About XPaths: the XPath engine does a limited range of xpaths. The doc
says that your paths are covered.

About logs: You only have the RegexTransformer listed. You need to add
LogTransformer to the transformer list:
http://wiki.apache.org/solr/DataImportHandler#LogTransformer

Having xml entity codes in the url string seems right. Can you verify
the url that goes to the remote site? Can you read the logs at the
remote site? Can you run this code through a proxy and watch the data?

On Fri, Aug 24, 2012 at 1:34 PM, Carrie Coy <c...@ssww.com> wrote:
> I'm trying to write a DIH to incorporate page view metrics from an XML feed
> into our index.   The DIH makes a single request, and updates 0 documents.
> I set log level to "finest" for the entire dataimport section, but I still
> can't tell what's wrong.  I suspect the XPath.
> http://localhost:8080/solr/core1/admin/dataimport.jsp?handler=/dataimport
> returns 404.  Any suggestions on how I can debug this?
>
>    *
>
>      solr-spec
>          4.0.0.2012.08.06.22.50.47
>
>
> The XML data:
>
> <?xml version='1.0' encoding='UTF-8'?>
> <ReportDataResponse>
> <Data>
> <Rows>
> <Row rowKey="P#PRODUCT: BURLAP POTATO SACKS  (PACK OF 12)
> (W4537)#N/A#550000000016196614" rowActionAvailability="0 0 0">
> <Value columnId="PAGE_NAME" comparisonSpecifier="A">PRODUCT: BURLAP POTATO
> SACKS  (PACK OF 12) (W4537)</Value>
> <Value columnId="PAGE_VIEWS" comparisonSpecifier="A">2388</Value>
> </Row>
> <Row rowKey="P#PRODUCT: OPAQUE PONY BEADS 6X9MM  (BAG OF 850)
> (BE9000)#N/A#550000000021976460" rowActionAvailability="0 0 0">
> <Value columnId="PAGE_NAME" comparisonSpecifier="A">PRODUCT: OPAQUE PONY
> BEADS 6X9MM  (BAG OF 850) (BE9000)</Value>
> <Value columnId="PAGE_VIEWS" comparisonSpecifier="A">1313</Value>
> </Row>
> </Rows>
> </Data>
> </ReportDataResponse>
>
> My DIH:
>
> |<dataConfig>
>  <dataSource name="coremetrics"
>              type="URLDataSource"
>              encoding="UTF-8"
>              connectionTimeout="5000"
>              readTimeout="10000"/>
>
>  <document>
>         <entity  name="coremetrics"
>             dataSource="coremetrics"
>             pk="id"
>
> url="https://welcome.coremetrics.com/analyticswebapp/api/1.0/report-data/contentcategory/bypage.ftl?clientId=******&amp;username=****&amp;format=XML&amp;userAuthKey=****&amp;language=en_US&mp;viewID=9475540&amp;period_a=M20110930";
>             processor="XPathEntityProcessor"
>             stream="true"
>             forEach="/ReportDataResponse/Data/Rows/Row"
>             logLevel="fine"
>             transformer="RegexTransformer"  >
>
>             <field  column="part_code"  name="id"
> xpath="/ReportDataResponse/Data/Rows/Row/Value[@columnId='PAGE_NAME']"
> regex="/^PRODUCT:.*\((.*?)\)$/"  replaceWith="$1"/>
>             <field  column="page_views"
> xpath="/ReportDataResponse/Data/Rows/Row/Value[@columnId='PAGE_VIEWS']"  />
>        </entity>
>  </document>
> </dataConfig>
> |
>
> |||This little test perl script correctly extracts the data:|
> ||
> |use XML::XPath;|
> |use XML::XPath::XMLParser;|
> ||
> |my $xp = XML::XPath->new(filename => 'cm.xml');|
> |||my $nodeset = $xp->find('/ReportDataResponse/Data/Rows/Row');|
> |||foreach my $node ($nodeset->get_nodelist) {|
> |||my $page_name = $node->findvalue('Value[@columnId="PAGE_NAME"]');|
> |    my $page_views = $node->findvalue('Value[@columnId="PAGE_VIEWS"]');|
> |    $page_name =~ s/^PRODUCT:.*\((.*?)\)$/$1/;|
> |}|
>
> From logs:
>
> INFO: Loading DIH Configuration: data-config.xml
> Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.DataImporter
> loadDataConfig
> INFO: Data Configuration loaded successfully
> Aug 24, 2012 3:53:10 PM org.apache.solr.core.SolrCore execute
> INFO: [ssww] webapp=/solr path=/dataimport params={command=full-import}
> status=0 QTime=2
> Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> INFO: Starting Full Import
> Aug 24, 2012 3:53:10 PM
> org.apache.solr.handler.dataimport.SimplePropertiesWriter
> readIndexerProperties
> INFO: Read dataimport.properties
> Aug 24, 2012 3:53:10 PM org.apache.solr.update.DirectUpdateHandler2
> deleteAll
> INFO: [ssww] REMOVING ALL DOCUMENTS FROM INDEX
> Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.URLDataSource
> getData
> FINE: Accessing URL:
> https://welcome.coremetrics.com/analyticswebapp/api/1.0/report-data/contentcategory/bypage.ftl?clientId=*****&username=***&format=XML&userAuthKey=******&language=en_US&viewID=9475540&period_a=M20110930
> Aug 24, 2012 3:53:10 PM org.apache.solr.core.SolrCore execute
> INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0
> QTime=0
> Aug 24, 2012 3:53:12 PM org.apache.solr.core.SolrCore execute
> INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0
> QTime=1
> Aug 24, 2012 3:53:14 PM org.apache.solr.core.SolrCore execute
> INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0
> QTime=1
> Aug 24, 2012 3:53:16 PM org.apache.solr.core.SolrCore execute
> INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0
> QTime=0
> Aug 24, 2012 3:53:18 PM org.apache.solr.core.SolrCore execute
> INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0
> QTime=0
> Aug 24, 2012 3:53:20 PM org.apache.solr.core.SolrCore execute
> INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0
> QTime=0
> Aug 24, 2012 3:53:22 PM org.apache.solr.core.SolrCore execute
> INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0
> QTime=0
> Aug 24, 2012 3:53:24 PM org.apache.solr.core.SolrCore execute
> INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0
> QTime=0
> Aug 24, 2012 3:53:27 PM org.apache.solr.core.SolrCore execute
> INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0
> QTime=0
> Aug 24, 2012 3:53:28 PM org.apache.solr.handler.dataimport.DocBuilder finish
> INFO: Import completed successfully
> Aug 24, 2012 3:53:28 PM org.apache.solr.update.DirectUpdateHandler2 commit
> INFO: start
> commit{flags=0,_version_=0,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false}
> Aug 24, 2012 3:53:28 PM org.apache.solr.core.SolrDeletionPolicy onCommit
> INFO: SolrDeletionPolicy.onCommit: commits:num=2
>
> commit{dir=/var/lib/tomcat6/solr/apache-solr-4.0.0-BETA/core1/data/index,segFN=segments_2b,generation=83,filenames=[segments_2b]
>
> commit{dir=/var/lib/tomcat6/solr/apache-solr-4.0.0-BETA/core1/data/index,segFN=segments_2c,generation=84,filenames=[segments_2c]
> Aug 24, 2012 3:53:28 PM org.apache.solr.core.SolrDeletionPolicy
> updateCommits
> INFO: newest commit = 84
> Aug 24, 2012 3:53:28 PM org.apache.solr.search.SolrIndexSearcher <init>
> INFO: Opening Searcher@ff33d42 main
> Aug 24, 2012 3:53:28 PM org.apache.solr.update.DirectUpdateHandler2 commit
> INFO: end_commit_flush
> Aug 24, 2012 3:53:28 PM org.apache.solr.core.QuerySenderListener newSearcher
> INFO: QuerySenderListener sending requests to Searcher@ff33d42
> main{StandardDirectoryReader(segments_2c:323)}
> Aug 24, 2012 3:53:28 PM org.apache.solr.core.QuerySenderListener newSearcher
> INFO: QuerySenderListener done.
> Aug 24, 2012 3:53:28 PM org.apache.solr.core.SolrCore registerSearcher
> INFO: [ssww] Registered new searcher Searcher@ff33d42
> main{StandardDirectoryReader(segments_2c:323)}
> Aug 24, 2012 3:53:28 PM
> org.apache.solr.handler.dataimport.SimplePropertiesWriter
> readIndexerProperties
> INFO: Read dataimport.properties
> Aug 24, 2012 3:53:28 PM
> org.apache.solr.handler.dataimport.SimplePropertiesWriter persist
> INFO: Wrote last indexed time to dataimport.properties
> Aug 24, 2012 3:53:28 PM org.apache.solr.handler.dataimport.DocBuilder
> execute
> INFO: Time taken = 0:0:17.918
> Aug 24, 2012 3:53:28 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: [ssww] webapp=/solr path=/dataimport params={command=full-import}
> status=0 QTime=2 {deleteByQuery=*:*,commit=} 0 2
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to