RE: FW: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-23 Thread Jiri Pik
David – can you advise if there is anything which can be done to speed up 
indexing? Are there any config parameters I could use to tweak the performance?



Joerg – the entire indexing looks completely differently. If I do not use the 
mapper attachment, the size of the index and the number of indexed documents 
grow at the same time. With mapper attachments, however, the size of the index 
grows with the number of indexed documents staying at 0 until the entire index 
is being built.





From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On 
Behalf Of joergpra...@gmail.com
Sent: Monday, February 23, 2015 5:13 PM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



I am not sure but it looks like mapper attachment is doing some extra 
processing, for example Tika, which is very expensive. Maybe there is some 
configuration option, I did not check.



Jörg



On Mon, Feb 23, 2015 at 2:13 PM, Jiri Pik 
mailto:jiri@jiripik.com>> wrote:

   The table has 3000 rows, the index is defined as below



   {

   {

   "index": {

   "primary_size_in_bytes": 296341451,

   "size_in_bytes": 296341451

   },

   "translog": {

   "operations": 0

   },

   "docs": {

   "num_docs": 3000,

   "max_doc": 3000,

   "deleted_docs": 0

   },},





   I believe it’s the mapper attachment who is causing this delay.



   David – is there any way to speed this up?



   From: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com> 
[mailto:elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>] 
On Behalf Of joergpra...@gmail.com<mailto:joergpra...@gmail.com>
   Sent: Monday, February 23, 2015 10:15 AM


   To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
   Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



   How big is the entire table you index?



   You can use monitor tools like BigDesk to verify the resources ES is using.



   It is close to impossible that just base64 encoding takes 20x longer while 
indexing, maybe mapper attachment is doing other extra work.



   Jörg



   On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik 
mailto:jiri@jiripik.com>> wrote:

  Thank you for opening of the issue.



  If I indexed the column as varchar and used the default ES indexing, the 
entire table is indexed within 5 seconds. If I use the Mapper Attachments, it 
takes up to 2 minutes. I am not sure whether it’s because of the extra work SQL 
Server is doing, or the extra volume the jdbc is taking care, but I assume it 
may be because of the way the Mapper Attachments works?







  From: 
elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com> 
[mailto:elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>] 
On Behalf Of joergpra...@gmail.com<mailto:joergpra...@gmail.com>
      Sent: Monday, February 23, 2015 9:26 AM
      To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
  Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



  1. I opened an issue for adding optional base64 encoding on columns: 
https://github.com/jprante/elasticsearch-river-jdbc/issues/472



  2. What is "initial indexing"? What do you mean by "slower"?



  3. Yes, you can change the documented bulk index settings.



  Jörg





  On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik 
mailto:jiri@jiripik.com>> wrote:

 Apologies for everyone for sending these emails with digital signature 
which may have caused some issues:



 Summary for Joerg:



 1.   Is there a way for the JDBC river to transform the 
nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for 
David – but it’s substantially slower

 2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL 
Server type? And then the SELECT * from XXX would just work?



 Summary for David:

 1.   If I convert the HTML column using select ID, cast(N'' as 
xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 
'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX 
)) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but 
takes longer than usual – is there any performance setting I could use?

 2.   Would it be possible for the attachment mapper to index pure 
txt file without base64?













 From: Jiri Pik
 Sent: Monday, February 23, 2015 6:08 AM
 To: 
elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
 Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database



 Thank

Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-23 Thread joergpra...@gmail.com
I am not sure but it looks like mapper attachment is doing some extra
processing, for example Tika, which is very expensive. Maybe there is some
configuration option, I did not check.

Jörg

On Mon, Feb 23, 2015 at 2:13 PM, Jiri Pik  wrote:

> The table has 3000 rows, the index is defined as below
>
>
>
> {
>
> {
>
> "index": {
>
> "primary_size_in_bytes": 296341451,
>
> "size_in_bytes": 296341451
>
> },
>
> "translog": {
>
> "operations": 0
>
> },
>
> "docs": {
>
> "num_docs": 3000,
>
> "max_doc": 3000,
>
> "deleted_docs": 0
>
> },},
>
>
>
>
>
> I believe it’s the mapper attachment who is causing this delay.
>
>
>
> David – is there any way to speed this up?
>
>
>
> *From:* elasticsearch@googlegroups.com [mailto:
> elasticsearch@googlegroups.com] *On Behalf Of *joergpra...@gmail.com
> *Sent:* Monday, February 23, 2015 10:15 AM
>
> *To:* elasticsearch@googlegroups.com
> *Subject:* Re: FW: Indexing of HTML Column in an MS SQL Server 2014
> database
>
>
>
> How big is the entire table you index?
>
>
>
> You can use monitor tools like BigDesk to verify the resources ES is using.
>
>
>
> It is close to impossible that just base64 encoding takes 20x longer while
> indexing, maybe mapper attachment is doing other extra work.
>
>
>
> Jörg
>
>
>
> On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik  wrote:
>
> Thank you for opening of the issue.
>
>
>
> If I indexed the column as varchar and used the default ES indexing, the
> entire table is indexed within 5 seconds. If I use the Mapper Attachments,
> it takes up to 2 minutes. I am not sure whether it’s because of the extra
> work SQL Server is doing, or the extra volume the jdbc is taking care, but
> I assume it may be because of the way the Mapper Attachments works?
>
>
>
>
>
>
>
> *From:* elasticsearch@googlegroups.com [mailto:
> elasticsearch@googlegroups.com] *On Behalf Of *joergpra...@gmail.com
> *Sent:* Monday, February 23, 2015 9:26 AM
> *To:* elasticsearch@googlegroups.com
> *Subject:* Re: FW: Indexing of HTML Column in an MS SQL Server 2014
> database
>
>
>
> 1. I opened an issue for adding optional base64 encoding on columns:
> https://github.com/jprante/elasticsearch-river-jdbc/issues/472
>
>
>
> 2. What is "initial indexing"? What do you mean by "slower"?
>
>
>
> 3. Yes, you can change the documented bulk index settings.
>
>
>
> Jörg
>
>
>
>
>
> On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik  wrote:
>
> Apologies for everyone for sending these emails with digital signature
> which may have caused some issues:
>
>
>
> Summary for Joerg:
>
>
>
> 1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
> into Base64 by itself? I can do on SQL server – see below (1) for David –
> but it’s substantially slower
>
> 2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
> Server type? And then the SELECT * from XXX would just work?
>
>
>
> Summary for David:
>
> 1.   If I convert the HTML column using select ID, cast(N'' as xml).
> value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))',
> 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as
> varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the
> indexing just works but takes longer than usual – is there any performance
> setting I could use?
>
> 2.   Would it be possible for the attachment mapper to index pure txt
> file without base64?
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Jiri Pik
> *Sent:* Monday, February 23, 2015 6:08 AM
> *To:* elasticsearch@googlegroups.com
> *Subject:* RE: Indexing of HTML Column in an MS SQL Server 2014 database
>
>
>
> Thank you very much for your kind answer. If I encode the html file into
> Base64, and use the enclosed script, then all works just fine.
>
>
>
> So, Joerg:
>
>
>
> 1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
> into Base64 by itself?
>
> 2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
> Server type? And then the SELECT * from XXX would just work?
>
>
>
> What are your thoughts?
>
>
>
> BTW I have been able to convert the nvarchar to base64 using this query
>
> select ID, cast(N'' as xml).value (
> 'xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)')
> as Content fro

RE: FW: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-23 Thread Jiri Pik
The table has 3000 rows, the index is defined as below

 

{

{

"index": {

"primary_size_in_bytes": 296341451,

"size_in_bytes": 296341451

},

"translog": {

"operations": 0

},

"docs": {

"num_docs": 3000,

"max_doc": 3000,

"deleted_docs": 0

},},

 

 

I believe it’s the mapper attachment who is causing this delay. 

 

David – is there any way to speed this up?

 

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On 
Behalf Of joergpra...@gmail.com
Sent: Monday, February 23, 2015 10:15 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

 

How big is the entire table you index?

 

You can use monitor tools like BigDesk to verify the resources ES is using.

 

It is close to impossible that just base64 encoding takes 20x longer while 
indexing, maybe mapper attachment is doing other extra work.

 

Jörg

 

On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik mailto:jiri@jiripik.com> > wrote:

Thank you for opening of the issue.

 

If I indexed the column as varchar and used the default ES indexing, the entire 
table is indexed within 5 seconds. If I use the Mapper Attachments, it takes up 
to 2 minutes. I am not sure whether it’s because of the extra work SQL Server 
is doing, or the extra volume the jdbc is taking care, but I assume it may be 
because of the way the Mapper Attachments works?

 

 

 

From: elasticsearch@googlegroups.com <mailto:elasticsearch@googlegroups.com>  
[mailto:elasticsearch@googlegroups.com <mailto:elasticsearch@googlegroups.com> 
] On Behalf Of joergpra...@gmail.com <mailto:joergpra...@gmail.com> 
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com <mailto:elasticsearch@googlegroups.com> 
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

 

1. I opened an issue for adding optional base64 encoding on columns: 
https://github.com/jprante/elasticsearch-river-jdbc/issues/472

 

2. What is "initial indexing"? What do you mean by "slower"?

 

3. Yes, you can change the documented bulk index settings.

 

Jörg

 

 

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik mailto:jiri@jiripik.com> > wrote:

Apologies for everyone for sending these emails with digital signature which 
may have caused some issues:

 

Summary for Joerg:

 

1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into 
Base64 by itself? I can do on SQL server – see below (1) for David – but it’s 
substantially slower

2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server 
type? And then the SELECT * from XXX would just work?

 

Summary for David:

1.   If I convert the HTML column using select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k; the indexing just works but takes longer than 
usual – is there any performance setting I could use?

2.   Would it be possible for the attachment mapper to index pure txt file 
without base64?

 

 

 

 

 

 

From: Jiri Pik 
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com <mailto:elasticsearch@googlegroups.com> 
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

 

Thank you very much for your kind answer. If I encode the html file into 
Base64, and use the enclosed script, then all works just fine. 

 

So, Joerg:

 

1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into 
Base64 by itself? 

2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server 
type? And then the SELECT * from XXX would just work?

 

What are your thoughts?

 

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k;

 

 

The usual river and mapper attachment work just fine but the initial indexing 
takes substantially longer. Why?

 

3.   Is there any performance settings I could tweak?

 

From: elasticsearch@googlegroups.com <mailto:elasticsearch@googlegroups.com>  
[mailto:elasticsearch@googlegroups.com] On Behalf Of joergpra...@gmail.com 
<mailto:joergpra...@gmail.com> 
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com <mailto:elasticsearch@googlegroups.com> 
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

 

Can you give some information about the mapper attachment setup you used 
successfully?

 

Th

Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-23 Thread joergpra...@gmail.com
How big is the entire table you index?

You can use monitor tools like BigDesk to verify the resources ES is using.

It is close to impossible that just base64 encoding takes 20x longer while
indexing, maybe mapper attachment is doing other extra work.

Jörg

On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik  wrote:

>  Thank you for opening of the issue.
>
>
>
> If I indexed the column as varchar and used the default ES indexing, the
> entire table is indexed within 5 seconds. If I use the Mapper Attachments,
> it takes up to 2 minutes. I am not sure whether it’s because of the extra
> work SQL Server is doing, or the extra volume the jdbc is taking care, but
> I assume it may be because of the way the Mapper Attachments works?
>
>
>
>
>
>
>
> *From:* elasticsearch@googlegroups.com [mailto:
> elasticsearch@googlegroups.com] *On Behalf Of *joergpra...@gmail.com
> *Sent:* Monday, February 23, 2015 9:26 AM
> *To:* elasticsearch@googlegroups.com
> *Subject:* Re: FW: Indexing of HTML Column in an MS SQL Server 2014
> database
>
>
>
> 1. I opened an issue for adding optional base64 encoding on columns:
> https://github.com/jprante/elasticsearch-river-jdbc/issues/472
>
>
>
> 2. What is "initial indexing"? What do you mean by "slower"?
>
>
>
> 3. Yes, you can change the documented bulk index settings.
>
>
>
> Jörg
>
>
>
>
>
> On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik  wrote:
>
>  Apologies for everyone for sending these emails with digital signature
> which may have caused some issues:
>
>
>
> Summary for Joerg:
>
>
>
> 1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
> into Base64 by itself? I can do on SQL server – see below (1) for David –
> but it’s substantially slower
>
> 2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
> Server type? And then the SELECT * from XXX would just work?
>
>
>
> Summary for David:
>
> 1.   If I convert the HTML column using select ID, cast(N'' as xml).
> value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))',
> 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as
> varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the
> indexing just works but takes longer than usual – is there any performance
> setting I could use?
>
> 2.   Would it be possible for the attachment mapper to index pure txt
> file without base64?
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Jiri Pik
> *Sent:* Monday, February 23, 2015 6:08 AM
> *To:* elasticsearch@googlegroups.com
> *Subject:* RE: Indexing of HTML Column in an MS SQL Server 2014 database
>
>
>
> Thank you very much for your kind answer. If I encode the html file into
> Base64, and use the enclosed script, then all works just fine.
>
>
>
> So, Joerg:
>
>
>
> 1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
> into Base64 by itself?
>
> 2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
> Server type? And then the SELECT * from XXX would just work?
>
>
>
> What are your thoughts?
>
>
>
> BTW I have been able to convert the nvarchar to base64 using this query
>
> select ID, cast(N'' as xml).value (
> 'xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)')
> as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as
> varbinary( MAX)) Content from KBArticles) k;
>
>
>
>
>
> The usual river and mapper attachment work just fine but the initial
> indexing takes substantially longer. Why?
>
>
>
> 3.   Is there any performance settings I could tweak?
>
>
>
> *From:* elasticsearch@googlegroups.com [
> mailto:elasticsearch@googlegroups.com ] *On
> Behalf Of *joergpra...@gmail.com
> *Sent:* Sunday, February 22, 2015 6:12 PM
> *To:* elasticsearch@googlegroups.com
> *Subject:* Re: Indexing of HTML Column in an MS SQL Server 2014 database
>
>
>
> Can you give some information about the mapper attachment setup you used
> successfully?
>
>
>
> There is no good reason why this should not be possible with JDBC river.
>
>
>
> Jörg
>
>
>
> On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik  wrote:
>
>  I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server
> database. I have set up a JDBC river
> https://github.com/jprante/elasticsearch-river-jdbc and the database is
> indexed.
>
> Using
>
>   "settings":{
>
> "analysis":{
>
>   "analyzer":{
>
> "default&q

RE: FW: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-23 Thread Jiri Pik
Thank you for opening of the issue.



If I indexed the column as varchar and used the default ES indexing, the entire 
table is indexed within 5 seconds. If I use the Mapper Attachments, it takes up 
to 2 minutes. I am not sure whether it’s because of the extra work SQL Server 
is doing, or the extra volume the jdbc is taking care, but I assume it may be 
because of the way the Mapper Attachments works?







From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On 
Behalf Of joergpra...@gmail.com
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



1. I opened an issue for adding optional base64 encoding on columns: 
https://github.com/jprante/elasticsearch-river-jdbc/issues/472



2. What is "initial indexing"? What do you mean by "slower"?



3. Yes, you can change the documented bulk index settings.



Jörg





On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik 
mailto:jiri@jiripik.com>> wrote:

   Apologies for everyone for sending these emails with digital signature which 
may have caused some issues:



   Summary for Joerg:



   1.   Is there a way for the JDBC river to transform the nvarchar(MAX) 
into Base64 by itself? I can do on SQL server – see below (1) for David – but 
it’s substantially slower

   2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL 
Server type? And then the SELECT * from XXX would just work?



   Summary for David:

   1.   If I convert the HTML column using select ID, cast(N'' as 
xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 
'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX 
)) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but 
takes longer than usual – is there any performance setting I could use?

   2.   Would it be possible for the attachment mapper to index pure txt 
file without base64?













   From: Jiri Pik
   Sent: Monday, February 23, 2015 6:08 AM
   To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
   Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database



   Thank you very much for your kind answer. If I encode the html file into 
Base64, and use the enclosed script, then all works just fine.



   So, Joerg:



   1.   Is there a way for the JDBC river to transform the nvarchar(MAX) 
into Base64 by itself?

   2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL 
Server type? And then the SELECT * from XXX would just work?



   What are your thoughts?



   BTW I have been able to convert the nvarchar to base64 using this query

   select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k;





   The usual river and mapper attachment work just fine but the initial 
indexing takes substantially longer. Why?



   3.   Is there any performance settings I could tweak?



   From: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com> 
[mailto:elasticsearch@googlegroups.com] On Behalf Of 
joergpra...@gmail.com<mailto:joergpra...@gmail.com>
   Sent: Sunday, February 22, 2015 6:12 PM
   To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
   Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database



   Can you give some information about the mapper attachment setup you used 
successfully?



   There is no good reason why this should not be possible with JDBC river.



   Jörg



   On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik 
mailto:jiri@googlemail.com>> wrote:

  I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server 
database. I have set up a JDBC river 
https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

  Using

"settings":{

  "analysis":{

"analyzer":{

  "default":{

"type":"custom",

"tokenizer":"standard",

"filter":[ "standard", "lowercase" ],

"char_filter" : ["html_strip"]

  }

}

  }

}

  is good for searching but not for the highlighter as that returns 
sometimes trimmed unpaired html tags.

  I have played with the Mapper Attachments with HTML attachments and then 
the highlighter works well - all original html tags are gone - but I am unable 
to get the river push the column directly to the Mapper Attachments.

  Questions:

  1. what is the best practic

Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-23 Thread joergpra...@gmail.com
1. I opened an issue for adding optional base64 encoding on columns:
https://github.com/jprante/elasticsearch-river-jdbc/issues/472

2. What is "initial indexing"? What do you mean by "slower"?

3. Yes, you can change the documented bulk index settings.

Jörg


On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik  wrote:

>  Apologies for everyone for sending these emails with digital signature
> which may have caused some issues:
>
>
>
> Summary for Joerg:
>
>
>
> 1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
> into Base64 by itself? I can do on SQL server – see below (1) for David –
> but it’s substantially slower
>
> 2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
> Server type? And then the SELECT * from XXX would just work?
>
>
>
> Summary for David:
>
> 1.   If I convert the HTML column using select ID, cast(N'' as xml).
> value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))',
> 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as
> varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the
> indexing just works but takes longer than usual – is there any performance
> setting I could use?
>
>  2.   Would it be possible for the attachment mapper to index pure
> txt file without base64?
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Jiri Pik
> *Sent:* Monday, February 23, 2015 6:08 AM
> *To:* elasticsearch@googlegroups.com
> *Subject:* RE: Indexing of HTML Column in an MS SQL Server 2014 database
>
>
>
> Thank you very much for your kind answer. If I encode the html file into
> Base64, and use the enclosed script, then all works just fine.
>
>
>
> So, Joerg:
>
>
>
> 1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
> into Base64 by itself?
>
> 2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
> Server type? And then the SELECT * from XXX would just work?
>
>
>
> What are your thoughts?
>
>
>
> BTW I have been able to convert the nvarchar to base64 using this query
>
> select ID, cast(N'' as xml).value (
> 'xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)')
> as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as
> varbinary( MAX)) Content from KBArticles) k;
>
>
>
>
>
> The usual river and mapper attachment work just fine but the initial
> indexing takes substantially longer. Why?
>
>
>
> 3.   Is there any performance settings I could tweak?
>
>
>
> *From:* elasticsearch@googlegroups.com [
> mailto:elasticsearch@googlegroups.com ] *On
> Behalf Of *joergpra...@gmail.com
> *Sent:* Sunday, February 22, 2015 6:12 PM
> *To:* elasticsearch@googlegroups.com
> *Subject:* Re: Indexing of HTML Column in an MS SQL Server 2014 database
>
>
>
> Can you give some information about the mapper attachment setup you used
> successfully?
>
>
>
> There is no good reason why this should not be possible with JDBC river.
>
>
>
> Jörg
>
>
>
> On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik  wrote:
>
>  I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server
> database. I have set up a JDBC river
> https://github.com/jprante/elasticsearch-river-jdbc and the database is
> indexed.
>
> Using
>
>   "settings":{
>
> "analysis":{
>
>   "analyzer":{
>
> "default":{
>
>   "type":"custom",
>
>   "tokenizer":"standard",
>
>   "filter":[ "standard", "lowercase" ],
>
>   "char_filter" : ["html_strip"]
>
> }
>
>   }
>
> }
>
>   }
>
> is good for searching but not for the highlighter as that returns
> sometimes trimmed unpaired html tags.
>
> I have played with the Mapper Attachments with HTML attachments and then
> the highlighter works well - all original html tags are gone - but I am
> unable to get the river push the column directly to the Mapper Attachments.
>
> Questions:
>
> 1. what is the best practice for indexing HTML columns? I am aware of the
> possibility of a manual removal of HTML tags using Agility Pack but do not
> like that as it's too much extra maintenance.
>
> 2. is there any better highlighter for html data which doesn't cut off any
> original html tags?
>
> 3. How to plug in the JDBC river to Mapper Attachments?
>
> 4. Any better ideas how to achieve my goals?
>
>
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this 

FW: Indexing of HTML Column in an MS SQL Server 2014 database

2015-02-22 Thread Jiri Pik
Apologies for everyone for sending these emails with digital signature which 
may have caused some issues:



Summary for Joerg:



1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into 
Base64 by itself? I can do on SQL server – see below (1) for David – but it’s 
substantially slower

2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server 
type? And then the SELECT * from XXX would just work?



Summary for David:

1.   If I convert the HTML column using select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k; the indexing just works but takes longer than 
usual – is there any performance setting I could use?



2.   Would it be possible for the attachment mapper to index pure txt file 
without base64?













From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database



Thank you very much for your kind answer. If I encode the html file into 
Base64, and use the enclosed script, then all works just fine.



So, Joerg:



1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into 
Base64 by itself?

2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server 
type? And then the SELECT * from XXX would just work?



What are your thoughts?



BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k;





The usual river and mapper attachment work just fine but the initial indexing 
takes substantially longer. Why?



3.   Is there any performance settings I could tweak?



From: elasticsearch@googlegroups.com 
[mailto:elasticsearch@googlegroups.com] On Behalf Of 
joergpra...@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database



Can you give some information about the mapper attachment setup you used 
successfully?



There is no good reason why this should not be possible with JDBC river.



Jörg



On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik 
mailto:jiri@googlemail.com>> wrote:

   I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I 
have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc 
and the database is indexed.

   Using

 "settings":{

   "analysis":{

 "analyzer":{

   "default":{

 "type":"custom",

 "tokenizer":"standard",

 "filter":[ "standard", "lowercase" ],

 "char_filter" : ["html_strip"]

   }

 }

   }

 }

   is good for searching but not for the highlighter as that returns sometimes 
trimmed unpaired html tags.

   I have played with the Mapper Attachments with HTML attachments and then the 
highlighter works well - all original html tags are gone - but I am unable to 
get the river push the column directly to the Mapper Attachments.

   Questions:

   1. what is the best practice for indexing HTML columns? I am aware of the 
possibility of a manual removal of HTML tags using Agility Pack but do not like 
that as it's too much extra maintenance.

   2. is there any better highlighter for html data which doesn't cut off any 
original html tags?

   3. How to plug in the JDBC river to Mapper Attachments?

   4. Any better ideas how to achieve my goals?



   Thanks!

   --
   You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
   To unsubscribe from this group and stop receiving emails from it, send an 
email to 
elasticsearch+unsubscr...@googlegroups.com.
   To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e5%40googlegroups.com.
   For more options, visit https://groups.google.com/d/optout.



   --
   You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
   To unsubscribe from this group and stop receiving emails from it, send an 
email to 
elasticsearch+unsubscr...@googlegroups.com.
   To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%4