RE: FW: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-23 Thread Jiri Pik
The table has 3000 rows, the index is defined as below

 

{

{

index: {

primary_size_in_bytes: 296341451,

size_in_bytes: 296341451

},

translog: {

operations: 0

},

docs: {

num_docs: 3000,

max_doc: 3000,

deleted_docs: 0

},},

 

 

I believe it’s the mapper attachment who is causing this delay. 

 

David – is there any way to speed this up?

 

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On 
Behalf Of joergpra...@gmail.com
Sent: Monday, February 23, 2015 10:15 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

 

How big is the entire table you index?

 

You can use monitor tools like BigDesk to verify the resources ES is using.

 

It is close to impossible that just base64 encoding takes 20x longer while 
indexing, maybe mapper attachment is doing other extra work.

 

Jörg

 

On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik jiri@jiripik.com 
mailto:jiri@jiripik.com  wrote:

Thank you for opening of the issue.

 

If I indexed the column as varchar and used the default ES indexing, the entire 
table is indexed within 5 seconds. If I use the Mapper Attachments, it takes up 
to 2 minutes. I am not sure whether it’s because of the extra work SQL Server 
is doing, or the extra volume the jdbc is taking care, but I assume it may be 
because of the way the Mapper Attachments works?

 

 

 

From: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com  
[mailto:elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com 
] On Behalf Of joergpra...@gmail.com mailto:joergpra...@gmail.com 
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com 
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

 

1. I opened an issue for adding optional base64 encoding on columns: 
https://github.com/jprante/elasticsearch-river-jdbc/issues/472

 

2. What is initial indexing? What do you mean by slower?

 

3. Yes, you can change the documented bulk index settings.

 

Jörg

 

 

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik jiri@jiripik.com 
mailto:jiri@jiripik.com  wrote:

Apologies for everyone for sending these emails with digital signature which 
may have caused some issues:

 

Summary for Joerg:

 

1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into 
Base64 by itself? I can do on SQL server – see below (1) for David – but it’s 
substantially slower

2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server 
type? And then the SELECT * from XXX would just work?

 

Summary for David:

1.   If I convert the HTML column using select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column(k.Content)))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k; the indexing just works but takes longer than 
usual – is there any performance setting I could use?

2.   Would it be possible for the attachment mapper to index pure txt file 
without base64?

 

 

 

 

 

 

From: Jiri Pik 
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com 
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

 

Thank you very much for your kind answer. If I encode the html file into 
Base64, and use the enclosed script, then all works just fine. 

 

So, Joerg:

 

1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into 
Base64 by itself? 

2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server 
type? And then the SELECT * from XXX would just work?

 

What are your thoughts?

 

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column(k.Content)))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k;

 

 

The usual river and mapper attachment work just fine but the initial indexing 
takes substantially longer. Why?

 

3.   Is there any performance settings I could tweak?

 

From: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com  
[mailto:elasticsearch@googlegroups.com] On Behalf Of joergpra...@gmail.com 
mailto:joergpra...@gmail.com 
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com 
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

 

Can you give some information about the mapper attachment setup you used 
successfully?

 

There is no good reason why this should not be possible with JDBC river.

 

Jörg

 

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri@googlemail.com 
mailto:jiri@googlemail.com  wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server

RE: FW: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-23 Thread Jiri Pik
David – can you advise if there is anything which can be done to speed up 
indexing? Are there any config parameters I could use to tweak the performance?



Joerg – the entire indexing looks completely differently. If I do not use the 
mapper attachment, the size of the index and the number of indexed documents 
grow at the same time. With mapper attachments, however, the size of the index 
grows with the number of indexed documents staying at 0 until the entire index 
is being built.





From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On 
Behalf Of joergpra...@gmail.com
Sent: Monday, February 23, 2015 5:13 PM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



I am not sure but it looks like mapper attachment is doing some extra 
processing, for example Tika, which is very expensive. Maybe there is some 
configuration option, I did not check.



Jörg



On Mon, Feb 23, 2015 at 2:13 PM, Jiri Pik 
jiri@jiripik.commailto:jiri@jiripik.com wrote:

   The table has 3000 rows, the index is defined as below



   {

   {

   index: {

   primary_size_in_bytes: 296341451,

   size_in_bytes: 296341451

   },

   translog: {

   operations: 0

   },

   docs: {

   num_docs: 3000,

   max_doc: 3000,

   deleted_docs: 0

   },},





   I believe it’s the mapper attachment who is causing this delay.



   David – is there any way to speed this up?



   From: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com 
[mailto:elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com] 
On Behalf Of joergpra...@gmail.commailto:joergpra...@gmail.com
   Sent: Monday, February 23, 2015 10:15 AM


   To: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com
   Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



   How big is the entire table you index?



   You can use monitor tools like BigDesk to verify the resources ES is using.



   It is close to impossible that just base64 encoding takes 20x longer while 
indexing, maybe mapper attachment is doing other extra work.



   Jörg



   On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik 
jiri@jiripik.commailto:jiri@jiripik.com wrote:

  Thank you for opening of the issue.



  If I indexed the column as varchar and used the default ES indexing, the 
entire table is indexed within 5 seconds. If I use the Mapper Attachments, it 
takes up to 2 minutes. I am not sure whether it’s because of the extra work SQL 
Server is doing, or the extra volume the jdbc is taking care, but I assume it 
may be because of the way the Mapper Attachments works?







  From: 
elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com 
[mailto:elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com] 
On Behalf Of joergpra...@gmail.commailto:joergpra...@gmail.com
  Sent: Monday, February 23, 2015 9:26 AM
  To: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com
  Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



  1. I opened an issue for adding optional base64 encoding on columns: 
https://github.com/jprante/elasticsearch-river-jdbc/issues/472



  2. What is initial indexing? What do you mean by slower?



  3. Yes, you can change the documented bulk index settings.



  Jörg





  On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik 
jiri@jiripik.commailto:jiri@jiripik.com wrote:

 Apologies for everyone for sending these emails with digital signature 
which may have caused some issues:



 Summary for Joerg:



 1.   Is there a way for the JDBC river to transform the 
nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for 
David – but it’s substantially slower

 2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL 
Server type? And then the SELECT * from XXX would just work?



 Summary for David:

 1.   If I convert the HTML column using select ID, cast(N'' as 
xml).value ('xs:base64Binary(xs:hexBinary(sql:column(k.Content)))', 
'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX 
)) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but 
takes longer than usual – is there any performance setting I could use?

 2.   Would it be possible for the attachment mapper to index pure 
txt file without base64?













 From: Jiri Pik
 Sent: Monday, February 23, 2015 6:08 AM
 To: 
elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com
 Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database



 Thank you very much for your kind answer. If I encode the html file 
into Base64, and use the enclosed script, then all works just fine.



 So, Joerg:



 1.   Is there a way for the JDBC river

RE: FW: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-23 Thread Jiri Pik
Thank you for opening of the issue.



If I indexed the column as varchar and used the default ES indexing, the entire 
table is indexed within 5 seconds. If I use the Mapper Attachments, it takes up 
to 2 minutes. I am not sure whether it’s because of the extra work SQL Server 
is doing, or the extra volume the jdbc is taking care, but I assume it may be 
because of the way the Mapper Attachments works?







From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On 
Behalf Of joergpra...@gmail.com
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



1. I opened an issue for adding optional base64 encoding on columns: 
https://github.com/jprante/elasticsearch-river-jdbc/issues/472



2. What is initial indexing? What do you mean by slower?



3. Yes, you can change the documented bulk index settings.



Jörg





On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik 
jiri@jiripik.commailto:jiri@jiripik.com wrote:

   Apologies for everyone for sending these emails with digital signature which 
may have caused some issues:



   Summary for Joerg:



   1.   Is there a way for the JDBC river to transform the nvarchar(MAX) 
into Base64 by itself? I can do on SQL server – see below (1) for David – but 
it’s substantially slower

   2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL 
Server type? And then the SELECT * from XXX would just work?



   Summary for David:

   1.   If I convert the HTML column using select ID, cast(N'' as 
xml).value ('xs:base64Binary(xs:hexBinary(sql:column(k.Content)))', 
'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX 
)) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but 
takes longer than usual – is there any performance setting I could use?

   2.   Would it be possible for the attachment mapper to index pure txt 
file without base64?













   From: Jiri Pik
   Sent: Monday, February 23, 2015 6:08 AM
   To: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com
   Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database



   Thank you very much for your kind answer. If I encode the html file into 
Base64, and use the enclosed script, then all works just fine.



   So, Joerg:



   1.   Is there a way for the JDBC river to transform the nvarchar(MAX) 
into Base64 by itself?

   2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL 
Server type? And then the SELECT * from XXX would just work?



   What are your thoughts?



   BTW I have been able to convert the nvarchar to base64 using this query

   select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column(k.Content)))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k;





   The usual river and mapper attachment work just fine but the initial 
indexing takes substantially longer. Why?



   3.   Is there any performance settings I could tweak?



   From: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com 
[mailto:elasticsearch@googlegroups.com] On Behalf Of 
joergpra...@gmail.commailto:joergpra...@gmail.com
   Sent: Sunday, February 22, 2015 6:12 PM
   To: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com
   Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database



   Can you give some information about the mapper attachment setup you used 
successfully?



   There is no good reason why this should not be possible with JDBC river.



   Jörg



   On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik 
jiri@googlemail.commailto:jiri@googlemail.com wrote:

  I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server 
database. I have set up a JDBC river 
https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

  Using

settings:{

  analysis:{

analyzer:{

  default:{

type:custom,

tokenizer:standard,

filter:[ standard, lowercase ],

char_filter : [html_strip]

  }

}

  }

}

  is good for searching but not for the highlighter as that returns 
sometimes trimmed unpaired html tags.

  I have played with the Mapper Attachments with HTML attachments and then 
the highlighter works well - all original html tags are gone - but I am unable 
to get the river push the column directly to the Mapper Attachments.

  Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the 
possibility of a manual removal of HTML tags using Agility Pack but do not like 
that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off 
any original html tags

Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-22 Thread Jiri Pik


I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. 
I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc 
and the database is indexed.

Using 

  settings:{

analysis:{

  analyzer:{

default:{

  type:custom,

  tokenizer:standard,

  filter:[ standard, lowercase ], 

  char_filter : [html_strip]

}

  }

}

  }

is good for searching but not for the highlighter as that returns sometimes 
trimmed unpaired html tags. 

I have played with the Mapper Attachments with HTML attachments and then 
the highlighter works well - all original html tags are gone - but I am 
unable to get the river push the column directly to the Mapper Attachments.

Questions:

1. what is the best practice for indexing HTML columns? I am aware of the 
possibility of a manual removal of HTML tags using Agility Pack but do not 
like that as it's too much extra maintenance.

2. is there any better highlighter for html data which doesn't cut off any 
original html tags?

3. How to plug in the JDBC river to Mapper Attachments?

4. Any better ideas how to achieve my goals?


Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


RE: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-22 Thread Jiri Pik
Thank you very much for your kind answer. If I encode the html file into 
Base64, and use the enclosed script, then all works just fine. 

 

So, Joerg:

 

1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into 
Base64 by itself? 



2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server 
type? And then the SELECT * from XXX would just work?

 

What are your thoughts?

 

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column(k.Content)))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k;

 

 

The usual river and mapper attachment work just fine but the initial indexing 
takes substantially longer. Why?

 

3.   Is there any performance settings I could tweak?

 

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On 
Behalf Of joergpra...@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

 

Can you give some information about the mapper attachment setup you used 
successfully?

 

There is no good reason why this should not be possible with JDBC river.

 

Jörg

 

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri@googlemail.com 
mailto:jiri@googlemail.com  wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I 
have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc 
and the database is indexed.

Using 

  settings:{

analysis:{

  analyzer:{

default:{

  type:custom,

  tokenizer:standard,

  filter:[ standard, lowercase ], 

  char_filter : [html_strip]

}

  }

}

  }

is good for searching but not for the highlighter as that returns sometimes 
trimmed unpaired html tags. 

I have played with the Mapper Attachments with HTML attachments and then the 
highlighter works well - all original html tags are gone - but I am unable to 
get the river push the column directly to the Mapper Attachments.

Questions:

1. what is the best practice for indexing HTML columns? I am aware of the 
possibility of a manual removal of HTML tags using Agility Pack but do not like 
that as it's too much extra maintenance.

2. is there any better highlighter for html data which doesn't cut off any 
original html tags?

3. How to plug in the JDBC river to Mapper Attachments?

4. Any better ideas how to achieve my goals?

 

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com 
mailto:elasticsearch+unsubscr...@googlegroups.com .
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com
 
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com?utm_medium=emailutm_source=footer
 .
For more options, visit https://groups.google.com/d/optout.

 

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com 
mailto:elasticsearch+unsubscr...@googlegroups.com .
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=emailutm_source=footer
 .
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/cbdf7a2f7f0d40f299ec9e51d8f1a4b5%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.
DELETE /test_e/ 
PUT /test_e/
{   

}

PUT /test_e/kbarticles/_mapping
{   

  kbarticles:{
properties : {
ID : {
type : integer,
store : yes,
index : analyzed
},
TitleHTML : {
type : string,
store : yes,
index : analyzed

FW: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-22 Thread Jiri Pik
Apologies for everyone for sending these emails with digital signature which 
may have caused some issues:



Summary for Joerg:



1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into 
Base64 by itself? I can do on SQL server – see below (1) for David – but it’s 
substantially slower

2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server 
type? And then the SELECT * from XXX would just work?



Summary for David:

1.   If I convert the HTML column using select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column(k.Content)))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k; the indexing just works but takes longer than 
usual – is there any performance setting I could use?



2.   Would it be possible for the attachment mapper to index pure txt file 
without base64?













From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database



Thank you very much for your kind answer. If I encode the html file into 
Base64, and use the enclosed script, then all works just fine.



So, Joerg:



1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into 
Base64 by itself?

2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server 
type? And then the SELECT * from XXX would just work?



What are your thoughts?



BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column(k.Content)))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k;





The usual river and mapper attachment work just fine but the initial indexing 
takes substantially longer. Why?



3.   Is there any performance settings I could tweak?



From: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com 
[mailto:elasticsearch@googlegroups.com] On Behalf Of 
joergpra...@gmail.commailto:joergpra...@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database



Can you give some information about the mapper attachment setup you used 
successfully?



There is no good reason why this should not be possible with JDBC river.



Jörg



On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik 
jiri@googlemail.commailto:jiri@googlemail.com wrote:

   I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I 
have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc 
and the database is indexed.

   Using

 settings:{

   analysis:{

 analyzer:{

   default:{

 type:custom,

 tokenizer:standard,

 filter:[ standard, lowercase ],

 char_filter : [html_strip]

   }

 }

   }

 }

   is good for searching but not for the highlighter as that returns sometimes 
trimmed unpaired html tags.

   I have played with the Mapper Attachments with HTML attachments and then the 
highlighter works well - all original html tags are gone - but I am unable to 
get the river push the column directly to the Mapper Attachments.

   Questions:

   1. what is the best practice for indexing HTML columns? I am aware of the 
possibility of a manual removal of HTML tags using Agility Pack but do not like 
that as it's too much extra maintenance.

   2. is there any better highlighter for html data which doesn't cut off any 
original html tags?

   3. How to plug in the JDBC river to Mapper Attachments?

   4. Any better ideas how to achieve my goals?



   Thanks!

   --
   You received this message because you are subscribed to the Google Groups 
elasticsearch group.
   To unsubscribe from this group and stop receiving emails from it, send an 
email to 
elasticsearch+unsubscr...@googlegroups.commailto:elasticsearch+unsubscr...@googlegroups.com.
   To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com?utm_medium=emailutm_source=footer.
   For more options, visit https://groups.google.com/d/optout.



   --
   You received this message because you are subscribed to the Google Groups 
elasticsearch group.
   To unsubscribe from this group and stop receiving emails from it, send an 
email to 
elasticsearch+unsubscr...@googlegroups.commailto:elasticsearch+unsubscr...@googlegroups.com.
   To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.comhttps

RE: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-22 Thread Jiri Pik
And David:

 

Would it be possible to index text/html given as text rather than Base64?

 

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On 
Behalf Of David Pilato
Sent: Sunday, February 22, 2015 6:15 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

 

Hi Jörg 

 

A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields 
in JDBC river?

(I did not look at the doc)

--

David ;-)

Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


Le 22 févr. 2015 à 18:11, joergpra...@gmail.com mailto:joergpra...@gmail.com 
 joergpra...@gmail.com mailto:joergpra...@gmail.com  a écrit :

Can you give some information about the mapper attachment setup you used 
successfully?

 

There is no good reason why this should not be possible with JDBC river.

 

Jörg

 

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri@googlemail.com 
mailto:jiri@googlemail.com  wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I 
have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc 
and the database is indexed.

Using 

  settings:{

analysis:{

  analyzer:{

default:{

  type:custom,

  tokenizer:standard,

  filter:[ standard, lowercase ], 

  char_filter : [html_strip]

}

  }

}

  }

is good for searching but not for the highlighter as that returns sometimes 
trimmed unpaired html tags. 

I have played with the Mapper Attachments with HTML attachments and then the 
highlighter works well - all original html tags are gone - but I am unable to 
get the river push the column directly to the Mapper Attachments.

Questions:

1. what is the best practice for indexing HTML columns? I am aware of the 
possibility of a manual removal of HTML tags using Agility Pack but do not like 
that as it's too much extra maintenance.

2. is there any better highlighter for html data which doesn't cut off any 
original html tags?

3. How to plug in the JDBC river to Mapper Attachments?

4. Any better ideas how to achieve my goals?

 

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com 
mailto:elasticsearch+unsubscr...@googlegroups.com .
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com
 
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com?utm_medium=emailutm_source=footer
 .
For more options, visit https://groups.google.com/d/optout.

 

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com 
mailto:elasticsearch+unsubscr...@googlegroups.com .
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=emailutm_source=footer
 .
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com 
mailto:elasticsearch+unsubscr...@googlegroups.com .
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr
 
https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr?utm_medium=emailutm_source=footer
 .
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d7d8ce2eca5f4f8fbe77b91aa8875ffc%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.


smime.p7s
Description: S/MIME cryptographic signature


RE: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-22 Thread Jiri Pik
David:

 

David: Do I need to use copy_to a new dummy column in order the highlighting to 
work???

 

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On 
Behalf Of David Pilato
Sent: Sunday, February 22, 2015 6:15 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

 

Hi Jörg 

 

A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields 
in JDBC river?

(I did not look at the doc)

--

David ;-)

Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


Le 22 févr. 2015 à 18:11, joergpra...@gmail.com mailto:joergpra...@gmail.com 
 joergpra...@gmail.com mailto:joergpra...@gmail.com  a écrit :

Can you give some information about the mapper attachment setup you used 
successfully?

 

There is no good reason why this should not be possible with JDBC river.

 

Jörg

 

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri@googlemail.com 
mailto:jiri@googlemail.com  wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I 
have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc 
and the database is indexed.

Using 

  settings:{

analysis:{

  analyzer:{

default:{

  type:custom,

  tokenizer:standard,

  filter:[ standard, lowercase ], 

  char_filter : [html_strip]

}

  }

}

  }

is good for searching but not for the highlighter as that returns sometimes 
trimmed unpaired html tags. 

I have played with the Mapper Attachments with HTML attachments and then the 
highlighter works well - all original html tags are gone - but I am unable to 
get the river push the column directly to the Mapper Attachments.

Questions:

1. what is the best practice for indexing HTML columns? I am aware of the 
possibility of a manual removal of HTML tags using Agility Pack but do not like 
that as it's too much extra maintenance.

2. is there any better highlighter for html data which doesn't cut off any 
original html tags?

3. How to plug in the JDBC river to Mapper Attachments?

4. Any better ideas how to achieve my goals?

 

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com 
mailto:elasticsearch+unsubscr...@googlegroups.com .
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com
 
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com?utm_medium=emailutm_source=footer
 .
For more options, visit https://groups.google.com/d/optout.

 

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com 
mailto:elasticsearch+unsubscr...@googlegroups.com .
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=emailutm_source=footer
 .
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com 
mailto:elasticsearch+unsubscr...@googlegroups.com .
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr
 
https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr?utm_medium=emailutm_source=footer
 .
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/eeeabbac7ce6425abc9edc47698d3413%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.


smime.p7s
Description: S/MIME cryptographic signature