Re: More debugging DIH - URLDataSource (solved)
Thank you for these suggestions. The real problem was incorrect syntax for the primary key column in data-config.xml. Once I corrected that, the data loaded fine. wrong: Right: On 08/25/2012 08:52 PM, Lance Norskog wrote: About XPaths: the XPath engine does a limited range of xpaths. The doc says that your paths are covered. About logs: You only have the RegexTransformer listed. You need to add LogTransformer to the transformer list: http://wiki.apache.org/solr/DataImportHandler#LogTransformer Having xml entity codes in the url string seems right. Can you verify the url that goes to the remote site? Can you read the logs at the remote site? Can you run this code through a proxy and watch the data? On Fri, Aug 24, 2012 at 1:34 PM, Carrie Coy wrote: I'm trying to write a DIH to incorporate page view metrics from an XML feed into our index. The DIH makes a single request, and updates 0 documents. I set log level to "finest" for the entire dataimport section, but I still can't tell what's wrong. I suspect the XPath. http://localhost:8080/solr/core1/admin/dataimport.jsp?handler=/dataimport returns 404. Any suggestions on how I can debug this? * solr-spec 4.0.0.2012.08.06.22.50.47 The XML data: PRODUCT: BURLAP POTATO SACKS (PACK OF 12) (W4537) 2388 PRODUCT: OPAQUE PONY BEADS 6X9MM (BAG OF 850) (BE9000) 1313 My DIH: | https://welcome.coremetrics.com/analyticswebapp/api/1.0/report-data/contentcategory/bypage.ftl?clientId=**&username=&format=XML&userAuthKey=&language=en_US∓viewID=9475540&period_a=M20110930"; processor="XPathEntityProcessor" stream="true" forEach="/ReportDataResponse/Data/Rows/Row" logLevel="fine" transformer="RegexTransformer"> | |||This little test perl script correctly extracts the data:| || |use XML::XPath;| |use XML::XPath::XMLParser;| || |my $xp = XML::XPath->new(filename => 'cm.xml');| |||my $nodeset = $xp->find('/ReportDataResponse/Data/Rows/Row');| |||foreach my $node ($nodeset->get_nodelist) {| |||my $page_name = $node->findvalue('Value[@columnId="PAGE_NAME"]');| |my $page_views = $node->findvalue('Value[@columnId="PAGE_VIEWS"]');| |$page_name =~ s/^PRODUCT:.*\((.*?)\)$/$1/;| |}| From logs: INFO: Loading DIH Configuration: data-config.xml Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.DataImporter loadDataConfig INFO: Data Configuration loaded successfully Aug 24, 2012 3:53:10 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=2 Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.SimplePropertiesWriter readIndexerProperties INFO: Read dataimport.properties Aug 24, 2012 3:53:10 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [ssww] REMOVING ALL DOCUMENTS FROM INDEX Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.URLDataSource getData FINE: Accessing URL: https://welcome.coremetrics.com/analyticswebapp/api/1.0/report-data/contentcategory/bypage.ftl?clientId=*&username=***&format=XML&userAuthKey=**&language=en_US&viewID=9475540&period_a=M20110930 Aug 24, 2012 3:53:10 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:12 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=1 Aug 24, 2012 3:53:14 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=1 Aug 24, 2012 3:53:16 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:18 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:20 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:22 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:24 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:27 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:28 PM org.apache.solr.handler.dataimport.DocBuilder finish INFO: Import completed successfully Aug 24, 2012 3:53:28 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit{flags=0,_version_=0,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false
Re: More debugging DIH - URLDataSource
About XPaths: the XPath engine does a limited range of xpaths. The doc says that your paths are covered. About logs: You only have the RegexTransformer listed. You need to add LogTransformer to the transformer list: http://wiki.apache.org/solr/DataImportHandler#LogTransformer Having xml entity codes in the url string seems right. Can you verify the url that goes to the remote site? Can you read the logs at the remote site? Can you run this code through a proxy and watch the data? On Fri, Aug 24, 2012 at 1:34 PM, Carrie Coy wrote: > I'm trying to write a DIH to incorporate page view metrics from an XML feed > into our index. The DIH makes a single request, and updates 0 documents. > I set log level to "finest" for the entire dataimport section, but I still > can't tell what's wrong. I suspect the XPath. > http://localhost:8080/solr/core1/admin/dataimport.jsp?handler=/dataimport > returns 404. Any suggestions on how I can debug this? > >* > > solr-spec > 4.0.0.2012.08.06.22.50.47 > > > The XML data: > > > > > > > PRODUCT: BURLAP POTATO > SACKS (PACK OF 12) (W4537) > 2388 > > > PRODUCT: OPAQUE PONY > BEADS 6X9MM (BAG OF 850) (BE9000) > 1313 > > > > > > My DIH: > > | >type="URLDataSource" > encoding="UTF-8" > connectionTimeout="5000" > readTimeout="1"/> > > > dataSource="coremetrics" > pk="id" > > url="https://welcome.coremetrics.com/analyticswebapp/api/1.0/report-data/contentcategory/bypage.ftl?clientId=**&username=&format=XML&userAuthKey=&language=en_US∓viewID=9475540&period_a=M20110930"; > processor="XPathEntityProcessor" > stream="true" > forEach="/ReportDataResponse/Data/Rows/Row" > logLevel="fine" > transformer="RegexTransformer" > > > xpath="/ReportDataResponse/Data/Rows/Row/Value[@columnId='PAGE_NAME']" > regex="/^PRODUCT:.*\((.*?)\)$/" replaceWith="$1"/> > xpath="/ReportDataResponse/Data/Rows/Row/Value[@columnId='PAGE_VIEWS']" /> > > > > | > > |||This little test perl script correctly extracts the data:| > || > |use XML::XPath;| > |use XML::XPath::XMLParser;| > || > |my $xp = XML::XPath->new(filename => 'cm.xml');| > |||my $nodeset = $xp->find('/ReportDataResponse/Data/Rows/Row');| > |||foreach my $node ($nodeset->get_nodelist) {| > |||my $page_name = $node->findvalue('Value[@columnId="PAGE_NAME"]');| > |my $page_views = $node->findvalue('Value[@columnId="PAGE_VIEWS"]');| > |$page_name =~ s/^PRODUCT:.*\((.*?)\)$/$1/;| > |}| > > From logs: > > INFO: Loading DIH Configuration: data-config.xml > Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.DataImporter > loadDataConfig > INFO: Data Configuration loaded successfully > Aug 24, 2012 3:53:10 PM org.apache.solr.core.SolrCore execute > INFO: [ssww] webapp=/solr path=/dataimport params={command=full-import} > status=0 QTime=2 > Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.DataImporter > doFullImport > INFO: Starting Full Import > Aug 24, 2012 3:53:10 PM > org.apache.solr.handler.dataimport.SimplePropertiesWriter > readIndexerProperties > INFO: Read dataimport.properties > Aug 24, 2012 3:53:10 PM org.apache.solr.update.DirectUpdateHandler2 > deleteAll > INFO: [ssww] REMOVING ALL DOCUMENTS FROM INDEX > Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.URLDataSource > getData > FINE: Accessing URL: > https://welcome.coremetrics.com/analyticswebapp/api/1.0/report-data/contentcategory/bypage.ftl?clientId=*&username=***&format=XML&userAuthKey=**&language=en_US&viewID=9475540&period_a=M20110930 > Aug 24, 2012 3:53:10 PM org.apache.solr.core.SolrCore execute > INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 > QTime=0 > Aug 24, 2012 3:53:12 PM org.apache.solr.core.SolrCore execute > INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 > QTime=1 > Aug 24, 2012 3:53:14 PM org.apache.solr.core.SolrCore execute > INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 > QTime=1 > Aug 24, 2012 3:53:16 PM org.apache.solr.core.SolrCore execute > INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 > QTime=0 > Aug 24, 2012 3:53:18 PM org.apache.solr.core.SolrCore execute > INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 > QTime=0 > Aug 24, 2012 3:53:20 PM org.apache.solr.core.SolrCore execute > INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 > QTime=0 > Aug 24, 2012 3:53:22 PM org.apache.solr.core.SolrCore execute > INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 > QTime=0 > Aug 24, 2012 3:53:24 PM org.apache.solr.core.SolrCore execute > INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 > QTime=0 > Aug 24, 2012 3:53:27 PM org.apache.solr.core.SolrCore execute > INFO: [ssww] webapp=/sol
More debugging DIH - URLDataSource
I'm trying to write a DIH to incorporate page view metrics from an XML feed into our index. The DIH makes a single request, and updates 0 documents. I set log level to "finest" for the entire dataimport section, but I still can't tell what's wrong. I suspect the XPath. http://localhost:8080/solr/core1/admin/dataimport.jsp?handler=/dataimport returns 404. Any suggestions on how I can debug this? * solr-spec 4.0.0.2012.08.06.22.50.47 The XML data: PRODUCT: BURLAP POTATO SACKS (PACK OF 12) (W4537) 2388 PRODUCT: OPAQUE PONY BEADS 6X9MM (BAG OF 850) (BE9000) 1313 My DIH: | https://welcome.coremetrics.com/analyticswebapp/api/1.0/report-data/contentcategory/bypage.ftl?clientId=**&username=&format=XML&userAuthKey=&language=en_US∓viewID=9475540&period_a=M20110930"; processor="XPathEntityProcessor" stream="true" forEach="/ReportDataResponse/Data/Rows/Row" logLevel="fine" transformer="RegexTransformer" > | |||This little test perl script correctly extracts the data:| || |use XML::XPath;| |use XML::XPath::XMLParser;| || |my $xp = XML::XPath->new(filename => 'cm.xml');| |||my $nodeset = $xp->find('/ReportDataResponse/Data/Rows/Row');| |||foreach my $node ($nodeset->get_nodelist) {| |||my $page_name = $node->findvalue('Value[@columnId="PAGE_NAME"]');| |my $page_views = $node->findvalue('Value[@columnId="PAGE_VIEWS"]');| |$page_name =~ s/^PRODUCT:.*\((.*?)\)$/$1/;| |}| From logs: INFO: Loading DIH Configuration: data-config.xml Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.DataImporter loadDataConfig INFO: Data Configuration loaded successfully Aug 24, 2012 3:53:10 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=2 Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.SimplePropertiesWriter readIndexerProperties INFO: Read dataimport.properties Aug 24, 2012 3:53:10 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [ssww] REMOVING ALL DOCUMENTS FROM INDEX Aug 24, 2012 3:53:10 PM org.apache.solr.handler.dataimport.URLDataSource getData FINE: Accessing URL: https://welcome.coremetrics.com/analyticswebapp/api/1.0/report-data/contentcategory/bypage.ftl?clientId=*&username=***&format=XML&userAuthKey=**&language=en_US&viewID=9475540&period_a=M20110930 Aug 24, 2012 3:53:10 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:12 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=1 Aug 24, 2012 3:53:14 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=1 Aug 24, 2012 3:53:16 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:18 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:20 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:22 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:24 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:27 PM org.apache.solr.core.SolrCore execute INFO: [ssww] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 Aug 24, 2012 3:53:28 PM org.apache.solr.handler.dataimport.DocBuilder finish INFO: Import completed successfully Aug 24, 2012 3:53:28 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit{flags=0,_version_=0,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false} Aug 24, 2012 3:53:28 PM org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=/var/lib/tomcat6/solr/apache-solr-4.0.0-BETA/core1/data/index,segFN=segments_2b,generation=83,filenames=[segments_2b] commit{dir=/var/lib/tomcat6/solr/apache-solr-4.0.0-BETA/core1/data/index,segFN=segments_2c,generation=84,filenames=[segments_2c] Aug 24, 2012 3:53:28 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 84 Aug 24, 2012 3:53:28 PM org.apache.solr.search.SolrIndexSearcher INFO: Opening Searcher@ff33d42 main Aug 24, 2012 3:53:28 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush Aug 24, 2012 3:53:28 PM org.apache.solr.core.