Re: [ts] Slow indexing (never finishes) when indexing 4 or more associations

jonsgold Tue, 07 Jul 2015 23:06:06 -0700

Awesome, works like a charm! See the SO question that refers to this 
thread: 
http://stackoverflow.com/questions/30913789/thinking-sphinx-indexing-performance.


On Tuesday, June 30, 2015 at 10:00:13 AM UTC+3, Pat Allan wrote:
>
> You’d have to end up with a fair bit of duplication, but it’s technically 
> possible. 
>
> # creates both core and delta indices 
> ThinkingSphinx::Index.define(:article, 
>   :with  => :active_record, 
>   :delta => ThinkingSphinx::Deltas::ResqueDelta 
> ) do 
>   # … 
> end 
>
> Is the equivalent of: 
>
> # create core index 
> ThinkingSphinx::Index.define(:article, 
>   :with            => :active_record, 
>   :delta?          => false, 
>   :delta_processor => 
> ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta) 
> ) do 
>   # … 
> end 
>
> # create delta index 
> ThinkingSphinx::Index.define(:article, 
>   :with            => :active_record, 
>   :delta?          => true, 
>   :delta_processor => 
> ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta) 
> ) do 
>   # … 
> end 
>
> The first being the core index, the second being the delta, with the same 
> definition block normally being applied to both. If you want to have 
> something slightly different in the delta index definition block, I guess 
> you could try something along these lines? 
>
> — 
> Pat 
>
> > On 30 Jun 2015, at 4:45 pm, [email protected] <javascript:> wrote: 
> > 
> > Thanks, sharding the joined queries works. I'd also like to improve them 
> for the deltas. Is there any way to add "WHERE delta = 1" to the joined 
> queries in the delta definition? 
> > 
> > On Monday, June 29, 2015 at 5:12:27 PM UTC+3, Pat Allan wrote: 
> > I’m not sure why the sizes are so different, but I think the overall 
> issue is related to the three attributes that have :source => :query. 
> > 
> > I’d recommend making two changes to each of them: 
> > 
> > * Add a condition to each query that filters by the appropriate incident 
> ids (like you’re doing for the main query) so the results are sharded in 
> the same way. 
> > * Perhaps add a second SQL statement to each of those attributes 
> (separated by a semi-colon), with :source set to :ranged_query, as covered 
> in the Sphinx documentation: 
> > http://sphinxsearch.com/docs/current.html#conf-sql-attr-multi 
> > 
> > The first of those isn’t too complex, so I’d start with that. Certainly 
> the second is far more fiddly, but may be worthwhile. 
> > 
> > Hope this helps! 
> > 
> > — 
> > Pat 
> > 
> >> On 29 Jun 2015, at 8:52 pm, [email protected] wrote: 
> >> 
> >> I even less understand the number of bytes in delta indexes 6 - 10. Why 
> does 1_delta contain 1128 bytes and 6_delta 24M? They're on the same 
> records. 
> >> 
> >> On Monday, June 29, 2015 at 9:03:04 AM UTC+3, [email protected] wrote: 
> >> Rails version: 4.1.7 
> >> TS version: 3.0.6 
> >> 
> >> On Monday, June 29, 2015 at 5:17:37 AM UTC+3, Pat Allan wrote: 
> >> Hi Jonathan 
> >> 
> >> Can you share your index definitions so I can get a better idea of 
> where the problem might be? 
> >> 
> >> Also: which versions of Rails and Thinking Sphinx are you using? 
> >> 
> >> — 
> >> Pat 
> >> 
> >>> On 28 Jun 2015, at 11:47 pm, [email protected] wrote: 
> >>> 
> >>> Hi Pat, 
> >>> 
> >>> I implemented according to this, and the indexing time went down (5 
> times faster on development). However, the delta indexing time went up (30 
> times slower on development). See below the indexing stats: 
> >>> 
> >>> Total docs        Bytes        Time 
> (sec)                                Total docs        Bytes        Time 
> (sec) 
> >>> 
> incident_index_1_core        7331        6531122        39.436                
>         incident_index_6_core        7331        28239593        8.802 
>
> >>> 
> incident_index_1_delta        6        1128        0.184                      
>   incident_index_6_delta        6        24763425        5.234 
>
> >>> 
> incident_index_2_core        7319        6751189        45.477                
>         incident_index_7_core        7319        28331726        8.819 
>
> >>> 
> incident_index_2_delta        5        843        0.233                       
>  incident_index_7_delta        5        24763289        5.321 
>
> >>> 
> incident_index_3_core        7390        6803814        42.064                
>         incident_index_8_core        7390        28310121        7.913 
>
> >>> 
> incident_index_3_delta        8        2143        0.203                      
>   incident_index_8_delta        8        24764366        5.282 
>
> >>> 
> incident_index_4_core        7278        6377664        37.665                
>         incident_index_9_core        7278        28162260        7.891 
>
> >>> 
> incident_index_4_delta        6        1108        0.436                      
>   incident_index_9_delta        6        24763330        5.456 
>
> >>> 
> incident_index_5_core        7396        6601358        39.704                
>         incident_index_10_core        7396        28152075        9.562 
>
> >>> 
> incident_index_5_delta        6        944        0.216                       
>  incident_index_10_delta        6        24763308        5.303 
>
> >>> 
> >>> Any idea why this is happening? 
> >>> 
> >>> Thanks, 
> >>> Jonathan 
> >>> 
> >>> On Friday, July 26, 2013 at 3:57:38 PM UTC+3, Pat Allan wrote: 
> >>> Heya Steve 
> >>> 
> >>> Was just looking into how difficult this would be to implement 
> properly, and noticed I have added the ability to take a string as the 
> source query - instead of the column references. So, it's possible without 
> hacking around in the index definition itself: 
> >>> 
> >>> https://gist.github.com/pat/6088629 
> >>> 
> >>> It's worth noting that the document id (Sphinx's equivalent of a 
> primary key) involves the normal primary key with an offset and a 
> multiplier. Make sure those two integers match what's in your generated 
> index in sql_query. They may change when you add other indices to your app 
> (depends on alphabetical order of your index files). 
> >>> 
> >>> Also: there's probably some metaprogramming you could add to simplify 
> things a bit more. 
> >>> 
> >>> Would love to hear if this approach helps with your real app and not 
> just the test one :) 
> >>> 
> >>> -- 
> >>> Pat 
> >>> 
> >>> On 26/07/2013, at 12:14 AM, Pat Allan wrote: 
> >>> 
> >>> > Hi Steve 
> >>> > 
> >>> > I've got a way forward to greatly improve the speed of indexing… 
> unfortunately, it's not going to work within Thinking Sphinx easily right 
> now. 
> >>> > 
> >>> > Sphinx has the ability to gather attribute and field values from 
> separate queries - this existed for TS v1/v2 for attributes, and fields was 
> added in TS v3, but the catch is those separate queries don't work for 
> HABTM joins. I'd love to change that, it's just painful from an 
> ActiveRecord perspective because you're not dealing with a model's table as 
> the base, but the HABTM join table. 
> >>> > 
> >>> > Here's the configuration for the relevant source that I modified by 
> hand: 
> >>> > https://gist.github.com/pat/6080031 
> >>> > 
> >>> > You'll see that the main query is nice and short - and then there's 
> each of the MVA and joined field definitions. If you put this in the 
> generated source definition in config/development.sphinx.conf, and then run 
> the indexer manually (NOT through the rake task, that'll overwrite this): 
> >>> >  indexer --config config/development.sphinx.conf --all --rotate 
> >>> > 
> >>> > (Remove --rotate if Sphinx isn't running.) You'll see it's pretty 
> damn fast. 
> >>> > 
> >>> > Now, ways forward? Well, I'd love to write something for TS v3 that 
> can handle HABTM - it's just a shame that it might need to be pure ARel 
> rather than ActiveRecord-built (which can otherwise help with joins). 
> >>> > 
> >>> > But otherwise: switch from HABTM to has_many/has_many :through - 
> make each of the joins an actual model. Then, you can add :source => :query 
> to each of the appropriate field and attribute definitions, and it should 
> generate something pretty much the same. 
> >>> > 
> >>> > Hope this provides some clarity at the very least! And also: thanks 
> for the test app, really helped with debugging! 
> >>> > 
> >>> > -- 
> >>> > Pat 
> >>> > 
> >>> > 
> >>> > On 25/07/2013, at 2:54 PM, Steve Kenworthy wrote: 
> >>> > 
> >>> >> Hi there, 
> >>> >> 
> >>> >> Firstly, thinking-sphinx is awesome and I love it. Thanks Pat for 
> an excellent project. V3 is looking great and represents a lot of hard work 
> and effort. 
> >>> >> 
> >>> >> I've been using thinking-sphinx to index a document model and it's 
> really slowed down when I add lots of associations in the index. In fact, 
> it never finishes on my machine (8Gig RAM, 8 CPU's) when I add 4 indexes. 
> >>> >> 
> >>> >> Times: 
> >>> >>         • 4 seconds - when 1 association (images) is indexed 
> >>> >>         • 6 seconds - when 2 associations (images and subscribers) 
> are indexed 
> >>> >>         • 23 seconds - when 2 associations (images and countries) 
> are indexed 
> >>> >>         • 115 seconds - when 3 associations (images, subscribers 
> and tags) are indexed 
> >>> >>         • 113 seconds - when 3 associations (images, subscribers 
> and videos) are indexed (just to prove it's not tags slowing it down) 
> >>> >>         • ꝏ (not finishing) - when 4 associations or more are 
> selected. 
> >>> >> 
> >>> >> Here's my index file: 
> >>> >> 
> >>> >> ThinkingSphinx::Index.define :document, with: :active_record, 
> delta: true, sql_range_step: 999999999, group_concat_max_len: 16384 do 
> >>> >> 
> >>> >>  has countries(:id), as: :country_ids 
> >>> >>  has images(:id), as: :image_ids, facet: true 
> >>> >>  has subscribers(:id), as: :subscriber_ids, facet: true 
> >>> >>  has tags(:id), as: :tag_ids, facet: true 
> >>> >>  has videos(:id), as: :video_ids, facet: true 
> >>> >> 
> >>> >>  indexes countries.name, as: :countries 
> >>> >>  indexes images.title, as: :images 
> >>> >>  indexes subscribers.title, as: :subscribers 
> >>> >>  indexes tags.name, as: :tags 
> >>> >>  indexes videos.title, as: :videos 
> >>> >> 
> >>> >>  has updated_at 
> >>> >> 
> >>> >> end 
> >>> >> 
> >>> >> The generated sql is a massive group_by query and is not finishing. 
> See it here 
> https://github.com/crossroads/rails3-ts-example#what-sphinx-is-doing 
> >>> >> 
> >>> >> I'd really appreciate some advice on how to optimise this so 
> indexing becomes viable again. Do I just have too much going on here? I'm 
> using facets, indexes and attributes. Perhaps there is a better way to 
> optimise? A friend suggested pre-computing with some joins... how would 
> this work? 
> >>> >> 
> >>> >> Vital stats: using mysql v14.14, sphinx 2.0.4, Ubuntu, rails 
> 3.2.13, thinking-sphinx 3.0.4 
> >>> >> 
> >>> >> For those who'd like to take a look, I've uploaded a sample project 
> here https://github.com/crossroads/rails3-ts-example which can be cloned. 
> If you follow the instructions, it will setup a db with test data and 
> reproduce the problem quickly. 
> >>> >> 
> >>> >> There's also the sphinx generated SQL and EXPLAIN: 
> https://github.com/crossroads/rails3-ts-example#what-sphinx-is-doing 
> >>> >> 
> >>> >> Thanks in advance for anyone taking the time to read. 
> >>> >> 
> >>> >> Regards, 
> >>> >> Steve 
> >>> >> 
> >>> >> -- 
> >>> >> You received this message because you are subscribed to the Google 
> Groups "Thinking Sphinx" group. 
> >>> >> To unsubscribe from this group and stop receiving emails from it, 
> send an email to [email protected]. 
> >>> >> To post to this group, send email to [email protected]. 
> >>> >> Visit this group at http://groups.google.com/group/thinking-sphinx. 
>
> >>> >> For more options, visit https://groups.google.com/groups/opt_out. 
> >>> >> 
> >>> >> 
> >>> > 
> >>> > 
> >>> > -- 
> >>> > You received this message because you are subscribed to the Google 
> Groups "Thinking Sphinx" group. 
> >>> > To unsubscribe from this group and stop receiving emails from it, 
> send an email to [email protected]. 
> >>> > To post to this group, send email to [email protected]. 
> >>> > Visit this group at http://groups.google.com/group/thinking-sphinx. 
> >>> > For more options, visit https://groups.google.com/groups/opt_out. 
> >>> > 
> >>> > 
> >>> 
> >>> 
> >>> 
> >>> -- 
> >>> You received this message because you are subscribed to the Google 
> Groups "Thinking Sphinx" group. 
> >>> To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> >>> To post to this group, send email to [email protected]. 
> >>> Visit this group at http://groups.google.com/group/thinking-sphinx. 
> >>> For more options, visit https://groups.google.com/d/optout. 
> >> 
> >> 
> >> -- 
> >> You received this message because you are subscribed to the Google 
> Groups "Thinking Sphinx" group. 
> >> To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected]. 
> >> To post to this group, send email to [email protected]. 
> >> Visit this group at http://groups.google.com/group/thinking-sphinx. 
> >> For more options, visit https://groups.google.com/d/optout. 
> > 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "Thinking Sphinx" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> > To post to this group, send email to [email protected] 
> <javascript:>. 
> > Visit this group at http://groups.google.com/group/thinking-sphinx. 
> > For more options, visit https://groups.google.com/d/optout. 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/thinking-sphinx.
For more options, visit https://groups.google.com/d/optout.

Re: [ts] Slow indexing (never finishes) when indexing 4 or more associations

Reply via email to