dih HTMLStripTransformer

2013-09-24 Thread Andreas Owen
why does stripHTML=false have no effect in dih? the html is strippedin text 
and text_nohtml when i do display the index with select?q=*

i'm trying to get a field without html and one with it so i can also index the 
links on the page.

data-config.xml
entity name=rec processor=XPathEntityProcessor 
url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml 
forEach=/docs/doc dataSource=main !-- transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=url xpath=//url /
field column=urlParse xpath=//urlParse /
field column=last_modified xpath=//last_modified /
field column=Author xpath=//author /

entity name=tika processor=TikaEntityProcessor 
url=${rec.urlParse} dataSource=dataUrl onError=skip htmlMapper=identity 
format=html transformer=HTMLStripTransformer
field column=text name=text stripHTML=false /
field column=text name=text_nohtml 
stripHTML=true /
!--  transformer=RegexTransformer
field column=text_html_b 
regex=(?s)^.*lt;div.*id=.*gt;(.*)lt;/divgt;.*$ replaceWith=$1 
sourceColName=text  /
field column=text_html_b 
regex=(?s)^.*lt;!-body-gt;(.*)lt;!-/body-gt;.*$ replaceWith=$1 
sourceColName=text  / --
/entity
/entity

Re: DIH: HTMLStripTransformer in sub-entities?

2013-07-06 Thread Andy Pickler
That's exactly what turned out to be the problem.  We thought we had
already tried that permutation but apparently hadn't.  I know it's obvious
in retrospect.  Thanks for the suggestion.

Thanks,
Andy Pickler

On Wed, Jul 3, 2013 at 2:38 PM, Alexandre Rafalovitch arafa...@gmail.comwrote:

 On Tue, Jul 2, 2013 at 10:59 AM, Andy Pickler andy.pick...@gmail.com
 wrote:

  SELECT
br.other_content AS replyContent
  FROM block_reply
  
  field column=other_content stripHTML=true / *THIS DOESN'T
 WORK!*
 

 shouldn't it be
 column=replyContent
 since you are renaming it in SELECT?

 Regards,
Alex.



 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)



Re: DIH: HTMLStripTransformer in sub-entities?

2013-07-03 Thread Alexandre Rafalovitch
On Tue, Jul 2, 2013 at 10:59 AM, Andy Pickler andy.pick...@gmail.comwrote:

 SELECT
   br.other_content AS replyContent
 FROM block_reply
 
 field column=other_content stripHTML=true / *THIS DOESN'T WORK!*


shouldn't it be
column=replyContent
since you are renaming it in SELECT?

Regards,
   Alex.



Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


DIH: HTMLStripTransformer in sub-entities?

2013-07-02 Thread Andy Pickler
Solr 4.1.0

We've been using the DIH to pull data in from a MySQL database for quite
some time now.  We're now wanting to strip all the HTML content out of many
fields using the HTMLStripTransformer (
http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer).
 Unfortunately, while it seems to be working fine for top-level entities,
we can't seem to get it to work for sub-entities:

(not exact schema, reduced for example purposes)

entity name=blocks dataSource=database
transformer=HTMLStripTransformer query=
  SELECT
id as blockId,
name as blockTitle,
content as content
  FROM engagement_block
  
  field column=content stripHTML=true /  *THIS WORKS!*
  entity name=blockReplies dataSource=database
transformer=HTMLStripTransformer query=
SELECT
  br.other_content AS replyContent
FROM block_reply

field column=other_content stripHTML=true / *THIS DOESN'T WORK!*
  /entity
/entity

We've tried several different permutations of putting the sub-entity column
in different nest levels of the XML to no avail.  I'm curious if we're
trying something that is just not supported or whether we are just trying
the wrong things.

Thanks,
Andy Pickler


Re: DIH: HTMLStripTransformer in sub-entities?

2013-07-02 Thread Gora Mohanty
On 2 July 2013 20:29, Andy Pickler andy.pick...@gmail.com wrote:
 Solr 4.1.0

 We've been using the DIH to pull data in from a MySQL database for quite
 some time now.  We're now wanting to strip all the HTML content out of many
 fields using the HTMLStripTransformer (
 http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer).
  Unfortunately, while it seems to be working fine for top-level entities,
 we can't seem to get it to work for sub-entities:

 (not exact schema, reduced for example purposes)

Please do not do that. This DIH configuration file does
not make sense (please see comments below), and we
are left guessing in the dark. If the file is too large,
you can share it on something like pastebin.com

 entity name=blocks dataSource=database
 transformer=HTMLStripTransformer query=
   SELECT
 id as blockId,
 name as blockTitle,
 content as content
   FROM engagement_block
   
   field column=content stripHTML=true /  *THIS WORKS!*
   entity name=blockReplies dataSource=database
 transformer=HTMLStripTransformer query=
 SELECT
   br.other_content AS replyContent
 FROM block_reply
 
 field column=other_content stripHTML=true / *THIS DOESN'T WORK!*
[...]

(a) You SELECT replyContent, but the column attribute
 in the field is named other_content. Nothing should
 be getting indexed into the field.
(b) Why are your entities nested if the inner entity has no
 relationship to the outer one?

Regards,
Gora


Re: DIH: HTMLStripTransformer in sub-entities?

2013-07-02 Thread Andy Pickler
Thanks for the quick reply.  Unfortunately, I don't believe my company
would want me sharing our exact production schema in a public forum,
although I realize it makes it harder to diagnose the problem.  The
sub-entity is a multi-valued field that indeed does have a relationship to
the outer entity.  I just left off the 'where' clause from the sub-entity,
as I didn't believe it was helpful in the context of this problem.  We use
the convention of..

SELECT dbColumnName AS solrFieldName

...so that we can relate the database column name to what we what it to be
named in the Solr index.

I don't think any of this helps you identify my problem, but I tried to
address your questions.

Thanks,
Andy

On Tue, Jul 2, 2013 at 9:14 AM, Gora Mohanty g...@mimirtech.com wrote:

 On 2 July 2013 20:29, Andy Pickler andy.pick...@gmail.com wrote:
  Solr 4.1.0
 
  We've been using the DIH to pull data in from a MySQL database for quite
  some time now.  We're now wanting to strip all the HTML content out of
 many
  fields using the HTMLStripTransformer (
  http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer).
   Unfortunately, while it seems to be working fine for top-level
 entities,
  we can't seem to get it to work for sub-entities:
 
  (not exact schema, reduced for example purposes)

 Please do not do that. This DIH configuration file does
 not make sense (please see comments below), and we
 are left guessing in the dark. If the file is too large,
 you can share it on something like pastebin.com

  entity name=blocks dataSource=database
  transformer=HTMLStripTransformer query=
SELECT
  id as blockId,
  name as blockTitle,
  content as content
FROM engagement_block

field column=content stripHTML=true /  *THIS WORKS!*
entity name=blockReplies dataSource=database
  transformer=HTMLStripTransformer query=
  SELECT
br.other_content AS replyContent
  FROM block_reply
  
  field column=other_content stripHTML=true / *THIS DOESN'T
 WORK!*
 [...]

 (a) You SELECT replyContent, but the column attribute
  in the field is named other_content. Nothing should
  be getting indexed into the field.
 (b) Why are your entities nested if the inner entity has no
  relationship to the outer one?

 Regards,
 Gora



Re: DIH: HTMLStripTransformer in sub-entities?

2013-07-02 Thread Gora Mohanty
On 2 July 2013 20:55, Andy Pickler andy.pick...@gmail.com wrote:
 Thanks for the quick reply.  Unfortunately, I don't believe my company
 would want me sharing our exact production schema in a public forum,
 although I realize it makes it harder to diagnose the problem.  The
 sub-entity is a multi-valued field that indeed does have a relationship to
 the outer entity.  I just left off the 'where' clause from the sub-entity,
 as I didn't believe it was helpful in the context of this problem.  We use
 the convention of..

 SELECT dbColumnName AS solrFieldName

 ...so that we can relate the database column name to what we what it to be
 named in the Solr index.

 I don't think any of this helps you identify my problem, but I tried to
 address your questions.

Um, with all due respect, I do not then know how to
address your issues in a public forum.

Maybe you are then better off hiring someone to handle
your specific problems, after signing a NDA or whatever
it takes from your side: Please see http://wiki.apache.org/solr/Support

Regards,
Gora


HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Ashok
Hi,

I am using DIH to index some database fields. These fields contain html
formatted text in them. I use the 'HTMLStripTransformer' to remove that
markup. This works fine when the text is like for example:

liItem One/li or *This is in Bold*

However when the text has HTML entity names like in:

lt;ligt;Item Onelt;/gt; or lt;bgt;This is in Boldlt;/bgt;

NOTHING HAPPENS. 

Two questions.

(1) Is this the expected behavior of DIH HTMLStripTransformer?
(2) If yes, is there an another transformer that I can employ first to turn
these html entities into their usual symbols that can then be removed by the
DIH HTMLStripTransformer?

Thanks

- ashok



--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Gora Mohanty
On 4 April 2013 00:30, Ashok ash...@qualcomm.com wrote:
[...]
 Two questions.

 (1) Is this the expected behavior of DIH HTMLStripTransformer?

Yes, I believe so.

 (2) If yes, is there an another transformer that I can employ first to turn
 these html entities into their usual symbols that can then be removed by the
 DIH HTMLStripTransformer?

How are the HTML tags getting converted into entities?
Are you escaping input somewhere?

Regards,
Gora


Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Ashok
Well, the database field has text,  sometimes with HTML entities and at other
times with html tags. I have no control over the process that populates the
database tables with info.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053586.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Alexandre Rafalovitch
Then, I would say, you have a bigger problem

However, you can probably run RegEx filter and replace those known escapes
with real characters before you run your HTMLStrip filter. Or run,
HTMLStrip, RegEx and HTMLStrip again.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Wed, Apr 3, 2013 at 3:19 PM, Ashok ash...@qualcomm.com wrote:

 Well, the database field has text,  sometimes with HTML entities and at
 other
 times with html tags. I have no control over the process that populates the
 database tables with info.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053586.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Steve Rowe
Hi Ashok,

HTMLStripTransformer uses HTMLStripCharFilter under the hood, and 
HTMLStripCharFilter converts all HTML entities to their corresponding 
characters.

What version of Solr are you using?

My guess is that it only appears that nothing is happening, since when they are 
presented in a browser, they show up as the characters the entities represent.

I think (never done this myself) that if you apply the HTMLStripTransformer 
twice, it will first convert the entities to characters, and then on the second 
pass, remove the HTML constructs.

From http://wiki.apache.org/solr/DataImportHandler#Transformer:

-
The entity transformer attribute can consist of a comma separated list of 
transformers (say transformer=foo.X,foo.Y). The transformers are chained in 
this case and they are applied one after the other in the order in which they 
are specified. What this means is that after the fields are fetched from the 
datasource, the list of entity columns are processed one at a time in the order 
listed inside the entity tag and scanned by the first transformer to see if any 
of that transformers attributes are present. If so the transformer does it's 
thing! When all of the listed entity columns have been scanned the process is 
repeated using the next transformer in the list.
-

Steve

On Apr 3, 2013, at 3:30 PM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Then, I would say, you have a bigger problem
 
 However, you can probably run RegEx filter and replace those known escapes
 with real characters before you run your HTMLStrip filter. Or run,
 HTMLStrip, RegEx and HTMLStrip again.
 
 Regards,
   Alex.
 
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Wed, Apr 3, 2013 at 3:19 PM, Ashok ash...@qualcomm.com wrote:
 
 Well, the database field has text,  sometimes with HTML entities and at
 other
 times with html tags. I have no control over the process that populates the
 database tables with info.




Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Ashok
Hi Steve,

Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice
did the trick. I am using Solr 4.1.

Thank you very much!

- ashok



--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053609.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Steve Rowe
Cool, glad I was able to help.

On Apr 3, 2013, at 4:18 PM, Ashok ash...@qualcomm.com wrote:

 Hi Steve,
 
 Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice
 did the trick. I am using Solr 4.1.
 
 Thank you very much!
 
 - ashok
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053609.html
 Sent from the Solr - User mailing list archive at Nabble.com.