Re: Indexing of HTML Column in an MS SQL Server 2014 database
Hi Jörg A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields in JDBC river? (I did not look at the doc) -- David ;-) Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs Le 22 févr. 2015 à 18:11, joergpra...@gmail.com joergpra...@gmail.com a écrit : Can you give some information about the mapper attachment setup you used successfully? There is no good reason why this should not be possible with JDBC river. Jörg On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri@googlemail.com wrote: I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed. Using settings:{ analysis:{ analyzer:{ default:{ type:custom, tokenizer:standard, filter:[ standard, lowercase ], char_filter : [html_strip] } } } } is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags. I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments. Questions: 1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance. 2. is there any better highlighter for html data which doesn't cut off any original html tags? 3. How to plug in the JDBC river to Mapper Attachments? 4. Any better ideas how to achieve my goals? Thanks! -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr. For more options, visit https://groups.google.com/d/optout.
Re: Indexing of HTML Column in an MS SQL Server 2014 database
For java.sql.Types.BLOB, I use the builder.value(Object object) method in XContentBuilder, with a byte array as parameter. For java.sql.Types.CLOB/NCLOB, I use just a string as returned by JDBC in Clob.getSubString There are DBs which store blobs as java.sql.Types.BINARY, and this can be passed as string or byte array to XContentBuilder (default is byte array). Here, it is a NVARCHAR column of MS SQL, which is alway returned by JDBC as string by the getNString() method. Jörg On Sun, Feb 22, 2015 at 6:14 PM, David Pilato da...@pilato.fr wrote: Hi Jörg A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields in JDBC river? (I did not look at the doc) -- David ;-) Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs Le 22 févr. 2015 à 18:11, joergpra...@gmail.com joergpra...@gmail.com a écrit : Can you give some information about the mapper attachment setup you used successfully? There is no good reason why this should not be possible with JDBC river. Jörg On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri@googlemail.com wrote: I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed. Using settings:{ analysis:{ analyzer:{ default:{ type:custom, tokenizer:standard, filter:[ standard, lowercase ], char_filter : [html_strip] } } } } is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags. I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments. Questions: 1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance. 2. is there any better highlighter for html data which doesn't cut off any original html tags? 3. How to plug in the JDBC river to Mapper Attachments? 4. Any better ideas how to achieve my goals? Thanks! -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEJ2NTzZThKsHmjKbw%2BLu0HWEw0UrUPsqE3wJtCMMGNpQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: Indexing of HTML Column in an MS SQL Server 2014 database
Can you give some information about the mapper attachment setup you used successfully? There is no good reason why this should not be possible with JDBC river. Jörg On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri@googlemail.com wrote: I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed. Using settings:{ analysis:{ analyzer:{ default:{ type:custom, tokenizer:standard, filter:[ standard, lowercase ], char_filter : [html_strip] } } } } is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags. I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments. Questions: 1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance. 2. is there any better highlighter for html data which doesn't cut off any original html tags? 3. How to plug in the JDBC river to Mapper Attachments? 4. Any better ideas how to achieve my goals? Thanks! -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
RE: Indexing of HTML Column in an MS SQL Server 2014 database
Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine. So, Joerg: 1. Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? 2. If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work? What are your thoughts? BTW I have been able to convert the nvarchar to base64 using this query select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column(k.Content)))', 'varchar(max)') as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why? 3. Is there any performance settings I could tweak? From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergpra...@gmail.com Sent: Sunday, February 22, 2015 6:12 PM To: elasticsearch@googlegroups.com Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database Can you give some information about the mapper attachment setup you used successfully? There is no good reason why this should not be possible with JDBC river. Jörg On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri@googlemail.com mailto:jiri@googlemail.com wrote: I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed. Using settings:{ analysis:{ analyzer:{ default:{ type:custom, tokenizer:standard, filter:[ standard, lowercase ], char_filter : [html_strip] } } } } is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags. I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments. Questions: 1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance. 2. is there any better highlighter for html data which doesn't cut off any original html tags? 3. How to plug in the JDBC river to Mapper Attachments? 4. Any better ideas how to achieve my goals? Thanks! -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com mailto:elasticsearch+unsubscr...@googlegroups.com . To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com mailto:elasticsearch+unsubscr...@googlegroups.com . To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cbdf7a2f7f0d40f299ec9e51d8f1a4b5%40Ex13DAG10-N1.dataoncloud.net. For more options, visit https://groups.google.com/d/optout. DELETE /test_e/ PUT /test_e/ { } PUT /test_e/kbarticles/_mapping { kbarticles:{ properties : { ID : { type : integer, store : yes, index : analyzed }, TitleHTML : { type : string, store : yes, index : analyzed
RE: Indexing of HTML Column in an MS SQL Server 2014 database
And David: Would it be possible to index text/html given as text rather than Base64? From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of David Pilato Sent: Sunday, February 22, 2015 6:15 PM To: elasticsearch@googlegroups.com Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database Hi Jörg A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields in JDBC river? (I did not look at the doc) -- David ;-) Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs Le 22 févr. 2015 à 18:11, joergpra...@gmail.com mailto:joergpra...@gmail.com joergpra...@gmail.com mailto:joergpra...@gmail.com a écrit : Can you give some information about the mapper attachment setup you used successfully? There is no good reason why this should not be possible with JDBC river. Jörg On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri@googlemail.com mailto:jiri@googlemail.com wrote: I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed. Using settings:{ analysis:{ analyzer:{ default:{ type:custom, tokenizer:standard, filter:[ standard, lowercase ], char_filter : [html_strip] } } } } is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags. I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments. Questions: 1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance. 2. is there any better highlighter for html data which doesn't cut off any original html tags? 3. How to plug in the JDBC river to Mapper Attachments? 4. Any better ideas how to achieve my goals? Thanks! -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com mailto:elasticsearch+unsubscr...@googlegroups.com . To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com mailto:elasticsearch+unsubscr...@googlegroups.com . To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com mailto:elasticsearch+unsubscr...@googlegroups.com . To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d7d8ce2eca5f4f8fbe77b91aa8875ffc%40Ex13DAG10-N1.dataoncloud.net. For more options, visit https://groups.google.com/d/optout. smime.p7s Description: S/MIME cryptographic signature
RE: Indexing of HTML Column in an MS SQL Server 2014 database
David: David: Do I need to use copy_to a new dummy column in order the highlighting to work??? From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of David Pilato Sent: Sunday, February 22, 2015 6:15 PM To: elasticsearch@googlegroups.com Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database Hi Jörg A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields in JDBC river? (I did not look at the doc) -- David ;-) Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs Le 22 févr. 2015 à 18:11, joergpra...@gmail.com mailto:joergpra...@gmail.com joergpra...@gmail.com mailto:joergpra...@gmail.com a écrit : Can you give some information about the mapper attachment setup you used successfully? There is no good reason why this should not be possible with JDBC river. Jörg On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri@googlemail.com mailto:jiri@googlemail.com wrote: I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed. Using settings:{ analysis:{ analyzer:{ default:{ type:custom, tokenizer:standard, filter:[ standard, lowercase ], char_filter : [html_strip] } } } } is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags. I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments. Questions: 1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance. 2. is there any better highlighter for html data which doesn't cut off any original html tags? 3. How to plug in the JDBC river to Mapper Attachments? 4. Any better ideas how to achieve my goals? Thanks! -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com mailto:elasticsearch+unsubscr...@googlegroups.com . To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com mailto:elasticsearch+unsubscr...@googlegroups.com . To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com mailto:elasticsearch+unsubscr...@googlegroups.com . To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eeeabbac7ce6425abc9edc47698d3413%40Ex13DAG10-N1.dataoncloud.net. For more options, visit https://groups.google.com/d/optout. smime.p7s Description: S/MIME cryptographic signature