Re: DIH Http input bug - problem with two-level RSS walker

Noble Paul നോബിള്‍ नोब्ळ् Sun, 02 Nov 2008 22:24:05 -0800

It may be fine to provide that but, what other benefit can you get
which you can't get from writing a Simple DataSource in java.Script is
just a convenience , right?


--Noble

On Mon, Nov 3, 2008 at 11:41 AM, Jon Baer <[EMAIL PROTECTED]> wrote:
> On a side note ... it would be nice if your data source could also be the
> result of a script (instead of trying to hack around it w/ JdbcDataSource)
> ...
>
> Something similar to what ScriptTransformer does ...
> (http://wiki.apache.org/solr/DataImportHandler#head-27fcc2794bd71f7d727104ffc6b99e194bdb6ff9)
>
> An example would be:
>
> <dataSource type="ScriptDataSource" name="outerloop" script="outerloop.js"
> />
>
> (The script would basically contain just a callback - getData(String query)
> that results in an array set or might set values on it's children, etc)
>
> - Jon
>
> On Nov 3, 2008, at 12:40 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> Hi Lance,
>> I guess I got your problem
>> So you wish to create docs for both entities (as suggested by Jon
>> Baer). So the best solution would be to create two root entities. The
>> first one should be the outer and write a transformer to store all the
>> urls into the db . The JdbcDataSource can do inserts/update too (the
>> method is same getData()). The second entity can read from db and
>> create docs  (see Jon baer's suggestion) using the
>> XPathEntityProcessor as a sub-entity
>> --Noble
>>
>> On Mon, Nov 3, 2008 at 9:44 AM, Noble Paul നോബിള്‍ नोब्ळ्
>> <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi Lance,
>>> Do a full import w/o debug and let us know if my suggestion worked
>>> (rootEntity="false" ) . If it didn't , I can suggest u something else
>>> (Writing a Transformer )
>>>
>>>
>>> On Sun, Nov 2, 2008 at 8:13 AM, Noble Paul നോബിള്‍ नोब्ळ्
>>> <[EMAIL PROTECTED]> wrote:
>>>>
>>>> If you wish to create 1 doc per inner entity the set
>>>> rootEntity="false" for the entity outer.
>>>> The exception is because the url is wrong
>>>>
>>>> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <[EMAIL PROTECTED]>
>>>> wrote:
>>>>>
>>>>> I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss
>>>>> feed
>>>>> which contains N links to other rss feeds. The nested loop then reads
>>>>> each
>>>>> one of those to create documents. (Yes, this is an obnoxious thing to
>>>>> do.)
>>>>> Let's say the outer RSS feed gives 10 items. Both feeds use the same
>>>>> structure: /rss/channel with a <title> node and then N <item> nodes
>>>>> inside
>>>>> the channel. This should create two separate XML streams with two
>>>>> separate
>>>>> Xpath iterators, right?
>>>>>
>>>>> <entity name="outer" http stuff>
>>>>>  <field column="name" xpath="/rss/channel/title" />
>>>>>  <field column="url" xpath="/rss/channel/item/link"/>
>>>>>
>>>>>  <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>>>>      <field column="title" xpath="/rss/channel/item/title" />
>>>>>  </entity>
>>>>> </entity>
>>>>>
>>>>> This does indeed walk each url from the outer feed and then fetch the
>>>>> inner
>>>>> rss feed. Bravo!
>>>>>
>>>>> However, I found two separate problems in xpath iteration. They may be
>>>>> related. The first problem is that it only stores the first document
>>>>> from
>>>>> each "inner" feed. Each feed has several documents with different title
>>>>> fields but it only grabs the first.
>>>>>
>>>>> The other is an off-by-one bug. The outer loop iterates through the 10
>>>>> items
>>>>> and then tries to pull an 11th.  It then gives this exception trace:
>>>>>
>>>>> INFO: Created URL to:  [inner url]
>>>>> Oct 31, 2008 11:21:20 PM
>>>>> org.apache.solr.handler.dataimport.HttpDataSource
>>>>> getData
>>>>> SEVERE: Exception thrown while getting data
>>>>> java.net.MalformedURLException: no protocol: null/account.rss
>>>>>      at java.net.URL.<init>(URL.java:567)
>>>>>      at java.net.URL.<init>(URL.java:464)
>>>>>      at java.net.URL.<init>(URL.java:413)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>>> a:90)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>>> a:47)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
>>>>> 3)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
>>>>> yProcessor.java:210)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
>>>>> tityProcessor.java:180)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
>>>>> rocessor.java:160)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
>>>>> 285)
>>>>> ...
>>>>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
>>>>> buildDocument
>>>>> SEVERE: Exception while processing: album document :
>>>>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>>>>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>>>>> Exception in
>>>>> invoking url null Processing Document # 11
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>>> a:115)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>>> a:47)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --Noble Paul
>>>>
>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>



-- 
--Noble Paul

Re: DIH Http input bug - problem with two-level RSS walker

Reply via email to