It’s broken again…

> On Apr 3, 2026, at 12:18 PM, Nicholas Chammas <[email protected]> 
> wrote:
> 
> Thanks for fixing this. I can confirm it’s working from my side.
> 
> Looks like we need some kind of alert on Algolia's crawl status 
> <https://www.algolia.com/doc/tools/crawler/troubleshooting/crawl-status>. If 
> there’s a way a non-committer can help with this, let me know.
> 
> 
>> On Apr 3, 2026, at 1:39 AM, Gengliang Wang <[email protected]> wrote:
>> 
>> Hi Nicholas,
>> 
>> The crawler configuration was not updated after the Spark 4.1.1 release, as 
>> documented in the release process 
>> <https://spark.apache.org/release-process.html>. I've fixed it.
>> 
>> A unit test isn't really feasible here since the doc search is powered by 
>> Algolia, but we could set up an Algolia monitoring alert to catch this 
>> proactively. I'll look into it when I have the bandwidth.
>> 
>> Gengliang
>> 
>> On Wed, Apr 1, 2026 at 3:09 PM Nicholas Chammas <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> It’s broken again. This is the third breakage I am reporting in the past 
>>> couple of years.
>>> 
>>> Is there some sort of alert or CI test we could setup to catch or prevent 
>>> this going forward?
>>> 
>>> 
>>>> On Dec 21, 2025, at 1:35 PM, Gengliang Wang <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> 
>>>> The crawler issue has been identified and fixed.
>>>> 
>>>> The root cause was that  by the crawler fails when the latest result 
>>>> contains less than 90% of the previous result. Increasing the 
>>>> `maxLostRecordsPercentage` threshold resolves the issue.
>>>> 
>>>> https://www.algolia.com/doc/tools/crawler/apis/configuration/safety-checks
>>>> 
>>>> 
>>>> On Wed, Dec 17, 2025 at 10:03 PM Xiao Li <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>>> Thanks for reporting it! Will take a look
>>>>> 
>>>>> Nicholas Chammas <[email protected] 
>>>>> <mailto:[email protected]>> 于2025年12月5日周五 04:19写道:
>>>>>> Bueller?
>>>>>> 
>>>>>> Is anyone on this list able to fix the crawler?
>>>>>> 
>>>>>> 
>>>>>>> On Dec 1, 2025, at 12:19 PM, Nicholas Chammas 
>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> This seems to be happening again.
>>>>>>> 
>>>>>>> Perhaps we should add a new test (but where, I wonder?) to ensure that 
>>>>>>> Algolia search doesn’t break without us knowing.
>>>>>>> 
>>>>>>> Nick
>>>>>>> 
>>>>>>> 
>>>>>>>> On Dec 11, 2023, at 5:02 AM, Gengliang Wang <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> Hi Nick,
>>>>>>>> 
>>>>>>>> Thank you for reporting the issue with our web crawler.
>>>>>>>> 
>>>>>>>> I've found that the issue was due to a change(specifically, pull 
>>>>>>>> request #40269 <https://github.com/apache/spark/pull/40269>) in the 
>>>>>>>> website's HTML structure, where the JavaScript selector 
>>>>>>>> ".container-wrapper" is now ".container". I've updated the crawler 
>>>>>>>> accordingly, and it's working properly now.
>>>>>>>> 
>>>>>>>> Gengliang
>>>>>>>> 
>>>>>>>> On Sun, Dec 10, 2023 at 8:15 AM Nicholas Chammas 
>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>> Pinging Gengliang and Xiao about this, per these docs 
>>>>>>>>> <https://github.com/apache/spark-website/blob/0ceaaaf528ec1d0201e1eab1288f37cce607268b/release-process.md#update-the-configuration-of-algolia-crawler>.
>>>>>>>>> 
>>>>>>>>> It looks like to fix this problem you need access to the Algolia 
>>>>>>>>> Crawler Admin Console.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Dec 5, 2023, at 11:28 AM, Nicholas Chammas 
>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Should I report this instead on Jira? Apologies if the dev list is 
>>>>>>>>>> not the right place.
>>>>>>>>>> 
>>>>>>>>>> Search on the website appears to be broken. For example, here is a 
>>>>>>>>>> search for “analyze”:
>>>>>>>>>> 
>>>>>>>>>> <Image 12-5-23 at 11.26 AM.jpeg>
>>>>>>>>>> 
>>>>>>>>>> And here is the same search using DDG 
>>>>>>>>>> <https://duckduckgo.com/?q=site:https://spark.apache.org/docs/latest/+analyze&t=osx&ia=web>.
>>>>>>>>>> 
>>>>>>>>>> Nick
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
> 

Reply via email to