Re: DIH Http input bug - problem with two-level RSS walker

Shalin Shekhar Mangar Fri, 31 Oct 2008 22:24:43 -0700

On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <[EMAIL PROTECTED]> wrote:


> I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss
> feed
> which contains N links to other rss feeds. The nested loop then reads each
> one of those to create documents. (Yes, this is an obnoxious thing to do.)
> Let's say the outer RSS feed gives 10 items. Both feeds use the same
> structure: /rss/channel with a <title> node and then N <item> nodes inside
> the channel. This should create two separate XML streams with two separate
> Xpath iterators, right?
>
> <entity name="outer" http stuff>
>    <field column="name" xpath="/rss/channel/title" />
>    <field column="url" xpath="/rss/channel/item/link"/>
>
>    <entity name="inner" http stuff url="${outer.url}" pk="title" >
>        <field column="title" xpath="/rss/channel/item/title" />
>    </entity>
> </entity>
>
> This does indeed walk each url from the outer feed and then fetch the inner
> rss feed. Bravo!
>
> However, I found two separate problems in xpath iteration. They may be
> related. The first problem is that it only stores the first document from
> each "inner" feed. Each feed has several documents with different title
> fields but it only grabs the first.
>

The idea behind nested entities is to join them together so that one Solr
document is created for each root entity and the child entities provide more
fields which are added to the parent document.

I guess you want to create separate Solr documents from the root entity as
well as the child entities. I don't think that is possible with nested
entities. Essentially, you are trying to crawl feeds, not join them.

Probably an integration with Apache Droids can be thought about.
http://incubator.apache.org/projects/droids.html
http://people.apache.org/~thorsten/droids/

If you are going to crawl only one level, there may be a workaround.
However, it may be easier to implement all this with your own Java program
and just post results to Solr as usual.



> The other is an off-by-one bug. The outer loop iterates through the 10
> items
> and then tries to pull an 11th.  It then gives this exception trace:
>
> INFO: Created URL to:  [inner url]
> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource
> getData
> SEVERE: Exception thrown while getting data
> java.net.MalformedURLException: no protocol: null/account.rss
>        at java.net.URL.<init>(URL.java:567)
>        at java.net.URL.<init>(URL.java:464)
>        at java.net.URL.<init>(URL.java:413)
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:90)
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:47)
>        at
>
> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
> 3)
>        at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
> yProcessor.java:210)
>        at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
> tityProcessor.java:180)
>        at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
> rocessor.java:160)
>        at
>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
> 285)
>  ...
> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> SEVERE: Exception while processing: album document :
> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
> org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in
> invoking url null Processing Document # 11
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:115)
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:47)
>
>
>
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: DIH Http input bug - problem with two-level RSS walker

Reply via email to