If you’ve got tag names and their corresponding ids, I think it’d be better 
(and more accurate) to query Sphinx by the ids:

  # in the index:
  has tag_ids

  # when searching, maybe something like:
  tag = Tag.find_by(name: params[:tag_name])
  Document.search params[:query], :with => {:tag_ids => tag.id}

It doesn’t answer the question why octothorps aren’t being indexed/searched 
correctly, but this should mean better search results generally.

Cheers,

— 
Pat

> On 24 Feb 2021, at 1:48 am, Walter Lee Davis <wa...@wdstudio.com> wrote:
> 
> 
> 
>> On Feb 23, 2021, at 12:02 AM, Pat Allan <p...@freelancing-gods.com 
>> <mailto:p...@freelancing-gods.com>> wrote:
>> 
>> Having the setting in the default block should be fine - you should be able 
>> to see the charset_table setting in the generated Sphinx configuration files.
>> 
>> Also: I generally recommend just using ts:rebuild, as that handles both 
>> real-time indices and SQL-backed indices (i.e. it’s running the same things 
>> as ts:rt:rebuild) - if you’re finding ts:rebuild is not working well for 
>> you, I’m keen to hear why!
> 
> While I was fighting with this, and fiddling with the configuration to use 
> has instead of indexes, I got myself into a state where ts:rebuild would blow 
> up with a SQL error (I think it was a Sphinx SQL error) and ts:rt:rebuild 
> would work fine. But with the current configuration that I shared with you, 
> both work.
> 
>> 
>> All that said, doesn’t sound like you’re doing anything wrong. I wonder if 
>> html_strip is somehow filtering out the octothorps? Though I’m pretty sure 
>> it’s looking just for HTML tags… still, may be worth turning that off to 
>> double-check.
>> 
>> And I’ve just run some quick tests locally - without the custom 
>> charset_table value, I find the string “#test” is found by Sphinx when 
>> searching by “#test” or “test” (because # is ignored, given it’s not an 
>> indexable character - so the two searches are actually identical). Adding in 
>> the charset_table setting, rebuilding - searching for #test returns a 
>> result, but test doesn’t (as that now doesn’t exist as a standalone word in 
>> what’s indexed).
>> 
>> I doubt it matters, but: which version of Sphinx are you using?
> 
> Sphinx 2.2.11-id64-release (95ae9a6), TS 5.0.0.
> 
> It's definitely odd. I'm not sure if re-indexing is picking up the tag names 
> when it runs en masse, and it seems to be something with GutenTag. If I find 
> a document in console, the object that I get back has tag_names set to nil, 
> but if I then call tag_names on that object, I get back the array of strings 
> I am expecting. It's just the value that I see inside the <> brackets 
> initially when to_s is called on the found object by irb, so I don't know if 
> that's significant at all, or is getting in the way of Sphinx extracting the 
> values. Again, when I test in console by calling my tags_for_indexing method 
> on a found object, I get back the expected string value.
> 
> I've told the client that she may need to get rid of her beloved hashtags in 
> the tagging interface, or use Gutentag in place of Sphinx to get "everything 
> tagged with this tag". I'm not convinced that's a bad idea, either.
> 
> Walter
> 
>> 
>> — 
>> Pat
>> 
>>> On 23 Feb 2021, at 3:10 pm, Walter Lee Davis <wa...@wdstudio.com> wrote:
>>> 
>>> Thanks for the speedy reply. I tried adding the charset table as 
>>> recommended, but I am not seeing any difference in my search results. I did 
>>> differ from the directions slightly, in that I put the character set in the 
>>> default block at the top of my Yaml file, since it's then included in all 
>>> of the environments. I figured that should work, but in case it doesn't can 
>>> you explain why?
>>> 
>>> default: &default
>>> morphology: stem_en
>>> html_strip: true
>>> batch_size: 300
>>> charset_table: "0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, 
>>> U+430..U+44F, U+23"
>>> 
>>> development:
>>> <<: *default
>>> 
>>> test:
>>> <<: *default
>>> 
>>> production:
>>> <<: *default
>>> 
>>> staging:
>>> <<: *default
>>> mysql41: 9320
>>> 
>>> 
>>> I forced a full rebuild/reindex with rake ts:rt:rebuild. When that didn't 
>>> seem to change things, I also ran rake ts:rebuild. My understanding is that 
>>> the first of these should be done when you use the Real Time index. If I'm 
>>> mistaken, please let me know.
>>> 
>>> Thanks again!
>>> 
>>> Walter
>>> 
>>>> On Feb 22, 2021, at 10:51 PM, Pat Allan <p...@freelancing-gods.com> wrote:
>>>> 
>>>> Hi Walter,
>>>> 
>>>> I’m pretty sure Sphinx doesn’t index punctuation by default. If you want 
>>>> octothorps included, you’ll need to define a custom charset_table value 
>>>> (per environment in `config/thinking_sphinx.yml`) which includes that 
>>>> character. The Sphinx docs outline the default, so best to take that and 
>>>> then add in the octothorp (U+23).
>>>> http://sphinxsearch.com/docs/current.html#conf-charset-table
>>>> https://freelancing-gods.com/thinking-sphinx/v5/advanced_config.html#character-sets-and-tables
>>>> 
>>>> Keep in mind that this will impact all uses of that character in all 
>>>> fields - there’s no way to have it apply to just some fields (or, in this 
>>>> case, words that only start with that character).
>>>> 
>>>> Once you’ve added this configuration, a full rebuild will be required.
>>>> 
>>>> Cheers,
>>>> 
>>>> — 
>>>> Pat
>>>> 
>>>>> On 23 Feb 2021, at 2:41 pm, Walter Lee Davis <wa...@wdstudio.com> wrote:
>>>>> 
>>>>> I'm using GutenTag to apply tags to individual pages in a CMS. The 
>>>>> Document model uses TS5 with Real-Time Indexing. I've set up my index 
>>>>> thusly:
>>>>> 
>>>>> # in the model
>>>>> def tags_for_indexing
>>>>> tag_names.join ' '
>>>>> end
>>>>> 
>>>>> # in the index
>>>>> ThinkingSphinx::Index.define :document, :with => :real_time do
>>>>> scope { Document.where(id: Document.publicly.map{ |d| 
>>>>> [d.id].concat(d.descendants.published.map(&:id)) }.flatten) }
>>>>> 
>>>>> indexes title
>>>>> indexes teaser
>>>>> indexes body_html
>>>>> indexes author_display
>>>>> indexes tags_for_indexing
>>>>> 
>>>>> has created_at, type: :timestamp
>>>>> has updated_at, type: :timestamp
>>>>> end
>>>>> 
>>>>> I've tested the method, and confirm that it outputs a space-delimited 
>>>>> string of words for the tags.
>>>>> 
>>>>> I run rake ts:rt:rebuild and everything seems to go fine. But trying to 
>>>>> search on some of these tag names is not returning the results I am 
>>>>> imagining. The client has insisted on making some of these tags start 
>>>>> with an octothorp, because she is writing about "hashtags" on Twitter. 
>>>>> Most tags do not have punctuation in them. I am able to find other terms, 
>>>>> even very obscure ones, when I don't use punctuation in the tag names. 
>>>>> 
>>>>> Does this sound like something that I can fix, or should I advise the 
>>>>> client to lay off the octothorps?
>>>>> 
>>>>> Walter
>>>>> 
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google Groups 
>>>>> "Thinking Sphinx" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>>> email to thinking-sphinx+unsubscr...@googlegroups.com.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/thinking-sphinx/EA71574B-9EBF-484E-A5FA-BF7CD53A10BC%40wdstudio.com.
>>>> 
>>>> 
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "Thinking Sphinx" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>> email to thinking-sphinx+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/thinking-sphinx/05B716CE-D5C7-40F6-BDE3-EC2859738632%40freelancing-gods.com.
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "Thinking Sphinx" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to thinking-sphinx+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/thinking-sphinx/0822E7D4-08AD-48D6-8105-3CC26F937006%40wdstudio.com.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Thinking Sphinx" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to thinking-sphinx+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/thinking-sphinx/09329FD3-9473-4361-B9DE-C4A1847C882D%40freelancing-gods.com.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Thinking Sphinx" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to thinking-sphinx+unsubscr...@googlegroups.com 
> <mailto:thinking-sphinx+unsubscr...@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/thinking-sphinx/683D4011-C092-4648-A61D-789D2EDF7E39%40wdstudio.com
>  
> <https://groups.google.com/d/msgid/thinking-sphinx/683D4011-C092-4648-A61D-789D2EDF7E39%40wdstudio.com>.

-- 
You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to thinking-sphinx+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/thinking-sphinx/91327038-CA2F-4DD4-A99A-AE5B2B5686CE%40freelancing-gods.com.

Reply via email to