dih HTMLStripTransformer
why does stripHTML=false have no effect in dih? the html is strippedin text and text_nohtml when i do display the index with select?q=* i'm trying to get a field without html and one with it so i can also index the links on the page. data-config.xml entity name=rec processor=XPathEntityProcessor url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml forEach=/docs/doc dataSource=main !-- transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=url xpath=//url / field column=urlParse xpath=//urlParse / field column=last_modified xpath=//last_modified / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.urlParse} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=HTMLStripTransformer field column=text name=text stripHTML=false / field column=text name=text_nohtml stripHTML=true / !-- transformer=RegexTransformer field column=text_html_b regex=(?s)^.*lt;div.*id=.*gt;(.*)lt;/divgt;.*$ replaceWith=$1 sourceColName=text / field column=text_html_b regex=(?s)^.*lt;!-body-gt;(.*)lt;!-/body-gt;.*$ replaceWith=$1 sourceColName=text / -- /entity /entity
Re: DIH: HTMLStripTransformer in sub-entities?
That's exactly what turned out to be the problem. We thought we had already tried that permutation but apparently hadn't. I know it's obvious in retrospect. Thanks for the suggestion. Thanks, Andy Pickler On Wed, Jul 3, 2013 at 2:38 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: On Tue, Jul 2, 2013 at 10:59 AM, Andy Pickler andy.pick...@gmail.com wrote: SELECT br.other_content AS replyContent FROM block_reply field column=other_content stripHTML=true / *THIS DOESN'T WORK!* shouldn't it be column=replyContent since you are renaming it in SELECT? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: DIH: HTMLStripTransformer in sub-entities?
On Tue, Jul 2, 2013 at 10:59 AM, Andy Pickler andy.pick...@gmail.comwrote: SELECT br.other_content AS replyContent FROM block_reply field column=other_content stripHTML=true / *THIS DOESN'T WORK!* shouldn't it be column=replyContent since you are renaming it in SELECT? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
DIH: HTMLStripTransformer in sub-entities?
Solr 4.1.0 We've been using the DIH to pull data in from a MySQL database for quite some time now. We're now wanting to strip all the HTML content out of many fields using the HTMLStripTransformer ( http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer). Unfortunately, while it seems to be working fine for top-level entities, we can't seem to get it to work for sub-entities: (not exact schema, reduced for example purposes) entity name=blocks dataSource=database transformer=HTMLStripTransformer query= SELECT id as blockId, name as blockTitle, content as content FROM engagement_block field column=content stripHTML=true / *THIS WORKS!* entity name=blockReplies dataSource=database transformer=HTMLStripTransformer query= SELECT br.other_content AS replyContent FROM block_reply field column=other_content stripHTML=true / *THIS DOESN'T WORK!* /entity /entity We've tried several different permutations of putting the sub-entity column in different nest levels of the XML to no avail. I'm curious if we're trying something that is just not supported or whether we are just trying the wrong things. Thanks, Andy Pickler
Re: DIH: HTMLStripTransformer in sub-entities?
On 2 July 2013 20:29, Andy Pickler andy.pick...@gmail.com wrote: Solr 4.1.0 We've been using the DIH to pull data in from a MySQL database for quite some time now. We're now wanting to strip all the HTML content out of many fields using the HTMLStripTransformer ( http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer). Unfortunately, while it seems to be working fine for top-level entities, we can't seem to get it to work for sub-entities: (not exact schema, reduced for example purposes) Please do not do that. This DIH configuration file does not make sense (please see comments below), and we are left guessing in the dark. If the file is too large, you can share it on something like pastebin.com entity name=blocks dataSource=database transformer=HTMLStripTransformer query= SELECT id as blockId, name as blockTitle, content as content FROM engagement_block field column=content stripHTML=true / *THIS WORKS!* entity name=blockReplies dataSource=database transformer=HTMLStripTransformer query= SELECT br.other_content AS replyContent FROM block_reply field column=other_content stripHTML=true / *THIS DOESN'T WORK!* [...] (a) You SELECT replyContent, but the column attribute in the field is named other_content. Nothing should be getting indexed into the field. (b) Why are your entities nested if the inner entity has no relationship to the outer one? Regards, Gora
Re: DIH: HTMLStripTransformer in sub-entities?
Thanks for the quick reply. Unfortunately, I don't believe my company would want me sharing our exact production schema in a public forum, although I realize it makes it harder to diagnose the problem. The sub-entity is a multi-valued field that indeed does have a relationship to the outer entity. I just left off the 'where' clause from the sub-entity, as I didn't believe it was helpful in the context of this problem. We use the convention of.. SELECT dbColumnName AS solrFieldName ...so that we can relate the database column name to what we what it to be named in the Solr index. I don't think any of this helps you identify my problem, but I tried to address your questions. Thanks, Andy On Tue, Jul 2, 2013 at 9:14 AM, Gora Mohanty g...@mimirtech.com wrote: On 2 July 2013 20:29, Andy Pickler andy.pick...@gmail.com wrote: Solr 4.1.0 We've been using the DIH to pull data in from a MySQL database for quite some time now. We're now wanting to strip all the HTML content out of many fields using the HTMLStripTransformer ( http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer). Unfortunately, while it seems to be working fine for top-level entities, we can't seem to get it to work for sub-entities: (not exact schema, reduced for example purposes) Please do not do that. This DIH configuration file does not make sense (please see comments below), and we are left guessing in the dark. If the file is too large, you can share it on something like pastebin.com entity name=blocks dataSource=database transformer=HTMLStripTransformer query= SELECT id as blockId, name as blockTitle, content as content FROM engagement_block field column=content stripHTML=true / *THIS WORKS!* entity name=blockReplies dataSource=database transformer=HTMLStripTransformer query= SELECT br.other_content AS replyContent FROM block_reply field column=other_content stripHTML=true / *THIS DOESN'T WORK!* [...] (a) You SELECT replyContent, but the column attribute in the field is named other_content. Nothing should be getting indexed into the field. (b) Why are your entities nested if the inner entity has no relationship to the outer one? Regards, Gora
Re: DIH: HTMLStripTransformer in sub-entities?
On 2 July 2013 20:55, Andy Pickler andy.pick...@gmail.com wrote: Thanks for the quick reply. Unfortunately, I don't believe my company would want me sharing our exact production schema in a public forum, although I realize it makes it harder to diagnose the problem. The sub-entity is a multi-valued field that indeed does have a relationship to the outer entity. I just left off the 'where' clause from the sub-entity, as I didn't believe it was helpful in the context of this problem. We use the convention of.. SELECT dbColumnName AS solrFieldName ...so that we can relate the database column name to what we what it to be named in the Solr index. I don't think any of this helps you identify my problem, but I tried to address your questions. Um, with all due respect, I do not then know how to address your issues in a public forum. Maybe you are then better off hiring someone to handle your specific problems, after signing a NDA or whatever it takes from your side: Please see http://wiki.apache.org/solr/Support Regards, Gora
HTML entities being missed by DIH HTMLStripTransformer
Hi, I am using DIH to index some database fields. These fields contain html formatted text in them. I use the 'HTMLStripTransformer' to remove that markup. This works fine when the text is like for example: liItem One/li or *This is in Bold* However when the text has HTML entity names like in: lt;ligt;Item Onelt;/gt; or lt;bgt;This is in Boldlt;/bgt; NOTHING HAPPENS. Two questions. (1) Is this the expected behavior of DIH HTMLStripTransformer? (2) If yes, is there an another transformer that I can employ first to turn these html entities into their usual symbols that can then be removed by the DIH HTMLStripTransformer? Thanks - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTML entities being missed by DIH HTMLStripTransformer
On 4 April 2013 00:30, Ashok ash...@qualcomm.com wrote: [...] Two questions. (1) Is this the expected behavior of DIH HTMLStripTransformer? Yes, I believe so. (2) If yes, is there an another transformer that I can employ first to turn these html entities into their usual symbols that can then be removed by the DIH HTMLStripTransformer? How are the HTML tags getting converted into entities? Are you escaping input somewhere? Regards, Gora
Re: HTML entities being missed by DIH HTMLStripTransformer
Well, the database field has text, sometimes with HTML entities and at other times with html tags. I have no control over the process that populates the database tables with info. -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053586.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTML entities being missed by DIH HTMLStripTransformer
Then, I would say, you have a bigger problem However, you can probably run RegEx filter and replace those known escapes with real characters before you run your HTMLStrip filter. Or run, HTMLStrip, RegEx and HTMLStrip again. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Apr 3, 2013 at 3:19 PM, Ashok ash...@qualcomm.com wrote: Well, the database field has text, sometimes with HTML entities and at other times with html tags. I have no control over the process that populates the database tables with info. -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053586.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTML entities being missed by DIH HTMLStripTransformer
Hi Ashok, HTMLStripTransformer uses HTMLStripCharFilter under the hood, and HTMLStripCharFilter converts all HTML entities to their corresponding characters. What version of Solr are you using? My guess is that it only appears that nothing is happening, since when they are presented in a browser, they show up as the characters the entities represent. I think (never done this myself) that if you apply the HTMLStripTransformer twice, it will first convert the entities to characters, and then on the second pass, remove the HTML constructs. From http://wiki.apache.org/solr/DataImportHandler#Transformer: - The entity transformer attribute can consist of a comma separated list of transformers (say transformer=foo.X,foo.Y). The transformers are chained in this case and they are applied one after the other in the order in which they are specified. What this means is that after the fields are fetched from the datasource, the list of entity columns are processed one at a time in the order listed inside the entity tag and scanned by the first transformer to see if any of that transformers attributes are present. If so the transformer does it's thing! When all of the listed entity columns have been scanned the process is repeated using the next transformer in the list. - Steve On Apr 3, 2013, at 3:30 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Then, I would say, you have a bigger problem However, you can probably run RegEx filter and replace those known escapes with real characters before you run your HTMLStrip filter. Or run, HTMLStrip, RegEx and HTMLStrip again. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Apr 3, 2013 at 3:19 PM, Ashok ash...@qualcomm.com wrote: Well, the database field has text, sometimes with HTML entities and at other times with html tags. I have no control over the process that populates the database tables with info.
Re: HTML entities being missed by DIH HTMLStripTransformer
Hi Steve, Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice did the trick. I am using Solr 4.1. Thank you very much! - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053609.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTML entities being missed by DIH HTMLStripTransformer
Cool, glad I was able to help. On Apr 3, 2013, at 4:18 PM, Ashok ash...@qualcomm.com wrote: Hi Steve, Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice did the trick. I am using Solr 4.1. Thank you very much! - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053609.html Sent from the Solr - User mailing list archive at Nabble.com.