I am not sure but it looks like mapper attachment is doing some extra
processing, for example Tika, which is very expensive. Maybe there is some
configuration option, I did not check.

Jörg

On Mon, Feb 23, 2015 at 2:13 PM, Jiri Pik <jiri....@jiripik.com> wrote:

> The table has 3000 rows, the index is defined as below
>
>
>
> {
>
> {
>
> "index": {
>
> "primary_size_in_bytes": 296341451,
>
> "size_in_bytes": 296341451
>
> },
>
> "translog": {
>
> "operations": 0
>
> },
>
> "docs": {
>
> "num_docs": 3000,
>
> "max_doc": 3000,
>
> "deleted_docs": 0
>
> },},
>
>
>
>
>
> I believe it’s the mapper attachment who is causing this delay.
>
>
>
> David – is there any way to speed this up?
>
>
>
> *From:* elasticsearch@googlegroups.com [mailto:
> elasticsearch@googlegroups.com] *On Behalf Of *joergpra...@gmail.com
> *Sent:* Monday, February 23, 2015 10:15 AM
>
> *To:* elasticsearch@googlegroups.com
> *Subject:* Re: FW: Indexing of HTML Column in an MS SQL Server 2014
> database
>
>
>
> How big is the entire table you index?
>
>
>
> You can use monitor tools like BigDesk to verify the resources ES is using.
>
>
>
> It is close to impossible that just base64 encoding takes 20x longer while
> indexing, maybe mapper attachment is doing other extra work.
>
>
>
> Jörg
>
>
>
> On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik <jiri....@jiripik.com> wrote:
>
> Thank you for opening of the issue.
>
>
>
> If I indexed the column as varchar and used the default ES indexing, the
> entire table is indexed within 5 seconds. If I use the Mapper Attachments,
> it takes up to 2 minutes. I am not sure whether it’s because of the extra
> work SQL Server is doing, or the extra volume the jdbc is taking care, but
> I assume it may be because of the way the Mapper Attachments works?
>
>
>
>
>
>
>
> *From:* elasticsearch@googlegroups.com [mailto:
> elasticsearch@googlegroups.com] *On Behalf Of *joergpra...@gmail.com
> *Sent:* Monday, February 23, 2015 9:26 AM
> *To:* elasticsearch@googlegroups.com
> *Subject:* Re: FW: Indexing of HTML Column in an MS SQL Server 2014
> database
>
>
>
> 1. I opened an issue for adding optional base64 encoding on columns:
> https://github.com/jprante/elasticsearch-river-jdbc/issues/472
>
>
>
> 2. What is "initial indexing"? What do you mean by "slower"?
>
>
>
> 3. Yes, you can change the documented bulk index settings.
>
>
>
> Jörg
>
>
>
>
>
> On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik <jiri....@jiripik.com> wrote:
>
> Apologies for everyone for sending these emails with digital signature
> which may have caused some issues:
>
>
>
> Summary for Joerg:
>
>
>
> 1.       Is there a way for the JDBC river to transform the nvarchar(MAX)
> into Base64 by itself? I can do on SQL server – see below (1) for David –
> but it’s substantially slower
>
> 2.       If not, do you recommend nvarbinary(MAX) or some other MS SQL
> Server type? And then the SELECT * from XXX would just work?
>
>
>
> Summary for David:
>
> 1.       If I convert the HTML column using select ID, cast(N'' as xml).
> value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))',
> 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as
> varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the
> indexing just works but takes longer than usual – is there any performance
> setting I could use?
>
> 2.       Would it be possible for the attachment mapper to index pure txt
> file without base64?
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Jiri Pik
> *Sent:* Monday, February 23, 2015 6:08 AM
> *To:* elasticsearch@googlegroups.com
> *Subject:* RE: Indexing of HTML Column in an MS SQL Server 2014 database
>
>
>
> Thank you very much for your kind answer. If I encode the html file into
> Base64, and use the enclosed script, then all works just fine.
>
>
>
> So, Joerg:
>
>
>
> 1.       Is there a way for the JDBC river to transform the nvarchar(MAX)
> into Base64 by itself?
>
> 2.       If not, do you recommend nvarbinary(MAX) or some other MS SQL
> Server type? And then the SELECT * from XXX would just work?
>
>
>
> What are your thoughts?
>
>
>
> BTW I have been able to convert the nvarchar to base64 using this query
>
> select ID, cast(N'' as xml).value (
> 'xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)')
> as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as
> varbinary( MAX)) Content from KBArticles) k;
>
>
>
>
>
> The usual river and mapper attachment work just fine but the initial
> indexing takes substantially longer. Why?
>
>
>
> 3.       Is there any performance settings I could tweak?
>
>
>
> *From:* elasticsearch@googlegroups.com [
> mailto:elasticsearch@googlegroups.com <elasticsearch@googlegroups.com>] *On
> Behalf Of *joergpra...@gmail.com
> *Sent:* Sunday, February 22, 2015 6:12 PM
> *To:* elasticsearch@googlegroups.com
> *Subject:* Re: Indexing of HTML Column in an MS SQL Server 2014 database
>
>
>
> Can you give some information about the mapper attachment setup you used
> successfully?
>
>
>
> There is no good reason why this should not be possible with JDBC river.
>
>
>
> Jörg
>
>
>
> On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri....@googlemail.com> wrote:
>
> I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database.
> I have set up a JDBC river
> https://github.com/jprante/elasticsearch-river-jdbc and the database is
> indexed.
>
> Using
>
>   "settings":{
>
>     "analysis":{
>
>       "analyzer":{
>
>         "default":{
>
>           "type":"custom",
>
>           "tokenizer":"standard",
>
>           "filter":[ "standard", "lowercase" ],
>
>           "char_filter" : ["html_strip"]
>
>         }
>
>       }
>
>     }
>
>   }
>
> is good for searching but not for the highlighter as that returns
> sometimes trimmed unpaired html tags.
>
> I have played with the Mapper Attachments with HTML attachments and then
> the highlighter works well - all original html tags are gone - but I am
> unable to get the river push the column directly to the Mapper Attachments.
>
> Questions:
>
> 1. what is the best practice for indexing HTML columns? I am aware of the
> possibility of a manual removal of HTML tags using Agility Pack but do not
> like that as it's too much extra maintenance.
>
> 2. is there any better highlighter for html data which doesn't cut off any
> original html tags?
>
> 3. How to plug in the JDBC river to Mapper Attachments?
>
> 4. Any better ideas how to achieve my goals?
>
>
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
>
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net
> <https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>
> .
>
>
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
>
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net
> <https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>
> .
>
>
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net
> <https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMyR-nMK%3Dymrec%2B3KF1nySUNJw7%2BSR0KcxHsxq7CrKdQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to