Re: DIH Http input bug - problem with two-level RSS walker

Jon Baer Sat, 01 Nov 2008 19:39:35 -0700

Another idea is to use create the logic you need and dump to a tempMySQL table and then fetch the feeds, that has worked pretty nicelyfor me, it removes the need for the outer feed to do the work. @first I could not figure out if this was a bug or feature ...Something like ...

<entity dataSource="db" name="db" query="SELECT id FROM table"processor="org.apache.solr.handler.dataimport.CachedSqlEntityProcessor"><entity dataSource="feeds" url="http://{$db.id}.somedomain.com/feed.xml" name="feeds" pk="link"processor="org.apache.solr.handler.dataimport.XPathEntityProcessor"forEach="/rss/channel/item"transformer="org.apache.solr.handler.dataimport.TemplateTransformer,org.apache.solr.handler.dataimport.DateFormatTransformer">

                                <field column="title" 
xpath="/rss/channel/item/title"/>
                                <field column="link" 
xpath="/rss/channel/item/link"/>
                                <field column="docid" 
template="DOC-${feeds.link}"/>
                                <field column="doctype" template="video"/>
                                <field column="description" 
xpath="/rss/channel/item/description"/>
                                <field column="thumbnail" 
xpath="/rss/channel/item/enclosure/@url"/>

<field column="pubdate" xpath="/rss/channel/item/pubDate"dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'"/>

                        </entity>
                </entity>


- Jon

On Nov 1, 2008, at 3:26 PM, Norskog, Lance wrote:

The inner entity drills down and gets more detail about each item inthe

outer loop. It creates one document.

-----Original Message-----
From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED]
Sent: Friday, October 31, 2008 10:24 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH Http input bug - problem with two-level RSS walker

On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <[EMAIL PROTECTED]>
wrote:

I wrote a nested HttpDataSource RSS poller. The outer loop reads an
rss feed which contains N links to other rss feeds. The nested loop
then reads each one of those to create documents. (Yes, this is an
obnoxious thing to do.) Let's say the outer RSS feed gives 10 items.
Both feeds use the same
structure: /rss/channel with a <title> node and then N <item> nodes
inside the channel. This should create two separate XML streams with
two separate Xpath iterators, right?

<entity name="outer" http stuff>
  <field column="name" xpath="/rss/channel/title" />
  <field column="url" xpath="/rss/channel/item/link"/>

  <entity name="inner" http stuff url="${outer.url}" pk="title" >
      <field column="title" xpath="/rss/channel/item/title" />
  </entity>
</entity>

This does indeed walk each url from the outer feed and then fetch the
inner rss feed. Bravo!

However, I found two separate problems in xpath iteration. They maybe

related. The first problem is that it only stores the first document
from each "inner" feed. Each feed has several documents withdifferent

title fields but it only grabs the first.


The idea behind nested entities is to join them together so that one
Solr document is created for each root entity and the child entities
provide more fields which are added to the parent document.

I guess you want to create separate Solr documents from the rootentity

as well as the child entities. I don't think that is possible with
nested entities. Essentially, you are trying to crawl feeds, not join
them.

Probably an integration with Apache Droids can be thought about.
http://incubator.apache.org/projects/droids.html
http://people.apache.org/~thorsten/droids/

If you are going to crawl only one level, there may be a workaround.
However, it may be easier to implement all this with your own Java
program and just post results to Solr as usual.

The other is an off-by-one bug. The outer loop iterates through the10

items and then tries to pull an 11th.  It then gives this exception
trace:

INFO: Created URL to:  [inner url]
Oct 31, 2008 11:21:20 PM
org.apache.solr.handler.dataimport.HttpDataSource
getData
SEVERE: Exception thrown while getting data
java.net.MalformedURLException: no protocol: null/account.rss
      at java.net.URL.<init>(URL.java:567)
      at java.net.URL.<init>(URL.java:464)
      at java.net.URL.<init>(URL.java:413)
      at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
ce.jav
a:90)
      at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
ce.jav
a:47)
      at
org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.j
ava:18
3)
      at
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat
hEntit
yProcessor.java:210)
      at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X
PathEn
tityProcessor.java:180)
      at
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE
ntityP
rocessor.java:160)
      at

org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j

ava:

285)
...
Oct 31, 2008 11:21:20 PMorg.apache.solr.handler.dataimport.DocBuilder
buildDocument
SEVERE: Exception while processing: album document :
SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
org.apache.solr.handler.dataimport.DataImportHandlerException:
Exception in invoking url null Processing Document # 11
      at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
ce.jav
a:115)
      at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
ce.jav
a:47)



--
Regards,
Shalin Shekhar Mangar.

Re: DIH Http input bug - problem with two-level RSS walker

Reply via email to