[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 Stefan Brüns changed: What|Removed |Added Resolution|--- |FIXED Latest Commit||https://invent.kde.org/fram ||eworks/baloo/-/commit/d97f3 ||f832f31a89f5ca4ee058043003b ||c1474223 Status|ASSIGNED|RESOLVED --- Comment #17 from Stefan Brüns --- Git commit d97f3f832f31a89f5ca4ee058043003bc1474223 by Stefan Brüns. Committed on 14/07/2025 at 12:13. Pushed by bruns into branch 'master'. [TermGenerator] Check input text validity In case the supplied text contains invalid surrogates (i.e. single low surrogates or without preceding high surrogate), the text is not valid unicode. This can also cause QString::toUtf8() to return an empty QByteArray. Related: bug 506570 M +43 -0autotests/unit/engine/termgeneratortest.cpp M +12 -3src/engine/termgenerator.cpp A +50 -0src/engine/termgenerator_p.h [License: LGPL(v2.1+)] https://invent.kde.org/frameworks/baloo/-/commit/d97f3f832f31a89f5ca4ee058043003bc1474223 -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 Bug Janitor Service changed: What|Removed |Added Status|CONFIRMED |ASSIGNED --- Comment #16 from Bug Janitor Service --- A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/241 -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 --- Comment #15 from Stefan Brüns --- Git commit 9fa1aaaf4a841224161e791cb8ffd366485dc7e3 by Stefan Brüns. Committed on 06/07/2025 at 18:16. Pushed by bruns into branch 'master'. [PlaintextExtractor] Fix various issues with UTF-16 Read the file in binary mode, feed the complete data into QStringDecoder with the detected encoding, and split the lines last. Opening a file with open mode "QIODevice::Text" mangles Carriage Return sequences, and the UTF16-LE sequence "\r\0\n\0" ends up as "\0\n\0", i.e. an invalid sequence. QIODevice::readline() only supports 8 bit encodings (see QTBUG 121812), and the fixup attempts here were not working in general. Unfortunately, QTextStream::setEncoding only supports UTF encodings, but none of the legacy ISO-8859 or Windows encodings or e.g. GB18030. M +0-2autotests/indexerextractortests.cpp M +53 -25 src/extractors/plaintextextractor.cpp https://invent.kde.org/frameworks/kfilemetadata/-/commit/9fa1aaaf4a841224161e791cb8ffd366485dc7e3 -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 --- Comment #14 from [email protected] --- There's a fix for the UTF-16 issue here: https://invent.kde.org/frameworks/kfilemetadata/-/merge_requests/193 Thank you Stefan! That's just landed on Neon Unstable. I don't know how long "due course" is but if it's on Neon Unstable it will arrive on Neon User in "due course" :-) This doesn't address Bug 506570, a binary file that says it's UTF-32, that seems a different issue -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 [email protected] changed: What|Removed |Added CC||[email protected] --- Comment #13 from [email protected] --- *** Bug 506570 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 [email protected] changed: What|Removed |Added CC||[email protected] --- Comment #12 from [email protected] --- *** Bug 506608 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 --- Comment #11 from [email protected] --- *** Bug 506598 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 --- Comment #10 from [email protected] --- (In reply to Stefan Brüns from comment #9) > This seems to be a cascade of bugs/implementation errors, finally triggering > the assert. I find a couple of things confusing: * This has suddenly started happening, with several very similar bugs. * All appear on Neon User, the test case we have works on Neon Testing and Unstable. > - The KFileMetaData plaintext extractor uses QIODevice::readline, although > this is not supported for 16bit encodings (see > https://bugreports.qt.io/browse/QTBUG-121812) > - The split code returns a term QString which only contains invalid unicode > code points > - QString::toUtf8() returns an empty QByteArray That would explain why if you convert the file to UTF-8 with iconv, Baloo is happy -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 Stefan Brüns changed: What|Removed |Added CC||[email protected] ||e --- Comment #9 from Stefan Brüns --- This seems to be a cascade of bugs/implementation errors, finally triggering the assert. - The KFileMetaData plaintext extractor uses QIODevice::readline, although this is not supported for 16bit encodings (see https://bugreports.qt.io/browse/QTBUG-121812) - The split code returns a term QString which only contains invalid unicode code points - QString::toUtf8() returns an empty QByteArray -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 toni_rocha changed: What|Removed |Added CC||[email protected] -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 [email protected] changed: What|Removed |Added CC||[email protected] --- Comment #8 from [email protected] --- *** Bug 506516 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 --- Comment #7 from [email protected] --- To tidy the UTF-16 loose end, converting the file from UTF-16 to UTF-8 with $ iconv -f UTF-16 -t UTF-8 home.csv > home2.csv Baloo can read and index it. -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 [email protected] changed: What|Removed |Added Ever confirmed|0 |1 Status|REPORTED|CONFIRMED --- Comment #6 from [email protected] --- (In reply to Garirry from comment #5) > No I don't use CJK generally. Although speaking of that, I do know that many > if not all of the files that are affected, if opened in an editor like > KWrite display CJK characters, and it detects an encoding of UTF-16. If I look at your uploaded "home.csv" (in Libreoffice Calc) it looks like a set of translations - 10 languages that include Japanese and Chinese scripts (plus English, German, French etc, etc and etc) (In reply to tagwerk19 from comment #4) > I don't get a crash ... I've just tried a clean install of Neon User. I now see a crash. > ... a completely different set of plain text terms on a Neon User (dodgy) ... Best discard that result, it was on a system with a custom locale (it had LC_TIME=en_SE.UTF-8 to get ISO format short dates - maybe that's too wierd...) So, I can flag "Confirmed" but don't really know where it goes from here (on the basis that I don't get the crash on Neon Unstable or Neon testing). Summarising what I see... Neon User Plasma: 6.4.1 Frameworks: 6.15.0 Qt: 6.9.0 Wayland Crashes Neon Testing: Plasma: 6.4.1 Frameworks: 6.16.0 Qt: 6.9.0 Wayland Seems OK Neon Unstable: Plasma: 6.4.80 Frameworks: 6.16.0 Qt: 6.9.0 Wayland Seems OK -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 --- Comment #5 from Garirry --- (In reply to tagwerk19 from comment #4) > I'm afraid don't really have an idea here... You are using CJK - Chinese? I > apologise for not being familiar. No I don't use CJK generally. Although speaking of that, I do know that many if not all of the files that are affected, if opened in an editor like KWrite display CJK characters, and it detects an encoding of UTF-16. If that would help I could upload more file samples. -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 --- Comment #4 from [email protected] --- (In reply to Garirry from comment #3) > The files which cause the crash do so consistently, if I don't exclude them > then baloo scans them again on system boot and crashes for each file. I'm afraid don't really have an idea here... You are using CJK - Chinese? I apologise for not being familiar. I don't get a crash but I do find that if I index the file and check with "balooshow6 -x home.csv", I get a completely different set of plain text terms on a Neon User (dodgy) compared to a Neon Unstable (more sensible) As a marker, we've also had a recent Bug 505968 where there is some strange behaviour with CJK. https://bugs.kde.org/show_bug.cgi?id=505968#c2 -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 --- Comment #3 from Garirry --- (In reply to tagwerk19 from comment #2) > Does the same happen if you have the file in a folder of its own and just > index that folder? (You can close down baloo and rename the > .local/share/baloo/index file to keep it save) Yes, the exact same error occurs. The files which cause the crash do so consistently, if I don't exclude them then baloo scans them again on system boot and crashes for each file. -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 [email protected] changed: What|Removed |Added CC||[email protected] --- Comment #2 from [email protected] --- (In reply to Garirry from comment #0) > #13 0x750c0a24e1f9 n/a (kfilemetadata_plaintextextractor.so + 0x31f9) If I run the same file on a more-or-less scratch system (Neon Unstable), I see Baloo deciding to use the plain text extractor (using inherited mimetype...) and then successfully indexing the file. It is an empty index though. > ASSERT: "!term.isEmpty()" in file ./src/engine/document.cpp, line 23 OK... that's fairly clear Does the same happen if you have the file in a folder of its own and just index that folder? (You can close down baloo and rename the .local/share/baloo/index file to keep it save) -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 --- Comment #1 from Garirry --- After further rebuilding the entire index, I can now add that those .csv files are not specifically the culprit, as there are many more that cause the exact same type of crash. The only thing that they have in common is that they all have text encoded as UTF-16. -- You are receiving this mail because: You are watching all bug changes.
[frameworks-baloo] [Bug 506187] baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
https://bugs.kde.org/show_bug.cgi?id=506187 Garirry changed: What|Removed |Added Summary|baloo_file_extractor|baloo_file_extractor |crashes on attempting to|crashes on attempting to |index specific CSV files|index specific files and |and spams tray |spams tray notifications |notifications | -- You are receiving this mail because: You are watching all bug changes.
