[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 --- Comment #11 from David Cook --- Created attachment 176768 --> https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176768&action=edit Bug 38913: (QA follow-up) test UTF-8 exceptions in large MARC records MARC records with over 9 bytes are invalid by spec, and when you use UTF-8 encoded characters in your MARC records, there is the potential to generate fatal errors in MARC::File::USMARC when it runs "marc_to_utf8" from "MARC::File::Encode" during its "decode" operation. That is, if you MARC::File::USMARC->encode a MARC record with over 9 bytes (including a number of UTF-8 bytes), there is the potential when you run MARC::File:USMARC->decode on that same data that you'll generate a fatal exception. The main patch in bug 38913 wraps the function doing the decode, so that a bad record doesn't crash processing. Without the patch, this unit test will fail. With the patch, this unit test will pass. -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 --- Comment #10 from David Cook --- (In reply to David Cook from comment #9) > Let's see if I can break the unit test in main... I added a bunch of Chinese and some of the Polish from the test record, and I couldn't get the unit tests to break in main. I was about to give up... when I tried again, and I managed to get the following: kohadev-koha@kohadevbox:koha(main)$ prove t/db_dependent/Koha/SearchEngine/Elasticsearch.t t/db_dependent/Koha/SearchEngine/Elasticsearch.t .. 1/8 # Looks like you planned 70 tests but ran 55. # Failed test 'Koha::SearchEngine::Elasticsearch::marc_records_to_documents () tests' # at t/db_dependent/Koha/SearchEngine/Elasticsearch.t line 805. UTF-8 "\x99" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. # Looks like your test exited with 11 just after 4. Hurray! -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 --- Comment #9 from David Cook --- Something is still bugging me... t/db_dependent/Koha/SearchEngine/Elasticsearch.t In theory, that test script has a test for a "large MARC record" which runs marc_records_to_documents() and it doesn't fail on "main". It must be longer than 9 bytes since it's switching to MARCXML from base64ISO2709. So we did test for large MARC records om bug 38416. But... since the fatal error Janusz is fixing comes from MARC::File::USMARC not handling an exception during marc_to_utf8(), it must also be because being over 9 bytes creates an invalid USMARC directory in the USMARC data. It then starts doing string handling using the positional math. And then it needs to get the right combination of invalid bytes. Since the unit test was just using single byte ASCII... it should be impossible to generate invalid UTF8 in this particular test scenario. Let's see if I can break the unit test in main... -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 David Cook changed: What|Removed |Added Status|Signed Off |Passed QA -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 David Cook changed: What|Removed |Added Attachment #176704|0 |1 is obsolete|| --- Comment #8 from David Cook --- Created attachment 176767 --> https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176767&action=edit Bug 38913: (bug 38416 follow-up) Elasticsearch indexing explodes with oversized records After Bug 38416 Elasticsearch indexing explodes with oversized records, especially with UTF encoded data. In Koha::SearchEngine::Elasticsearch::marc_records_to_documents a following snippet has been introduced: my $usmarc_record = $record->as_usmarc(); my $decoded_usmarc_record = MARC::Record->new_from_usmarc($usmarc_record); But if $record is oversized (> 9 bytes), it is OK for MARC::Record object, but not for $record->as_usmarc. The produced ISO 2709 string is not correct and hence cannot be properly converted back to MARC::Record object by new_from_usmarc. The result in this case can be like: UTF-8 "\x85" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. Since it is done without any eval / try, the whole reindex procedure (for instance rebuild_elasticsearch.pl) is being randomly interrupted with no explanation. Test plan: == Hard to reproduce. But the explanation together with discussion in Bug 38416 (from 2024-12-15) explains and justifies the need of this added eval. 1. Have a standard KTD installation with Elasticsearch. 2. Use the provided test record - add it to Koha with ./misc/migration_tools/bulkmarcimport.pl -b -file test.xml -m=MARCXML (have patience). During load process you should see a message like: UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. 3. The record should get biblionumber 439. Check in librarian interface with http://:8081/cgi-bin/koha/catalogue/detail.pl?biblionumber=439 that the record has been imported. However, you should not be able to make a search for this record. 4. Try to reindex with: ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439 You should get a message like: UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. Again, no search results. 5. Apply the patch ; restart_all. 6. Repeat reindex with: ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439 There should be no warning now and you should be able to find the record. Signed-off-by: Magnus Enger Followed the test plan. Works as advertised. Signed-off-by: David Cook -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 David Cook changed: What|Removed |Added Severity|major |blocker QA Contact|testo...@bugs.koha-communit |dc...@prosentient.com.au |y.org | --- Comment #7 from David Cook --- While my ktd was downloading, I looked at the code and warnings more and I see the logic here now. I must've been so focused on bug 38416 on fields with more than that we forgot to test records with more than 9. The MARC::* modules have some interesting quirks. MARC::File::USMARC could use some attention... -- So this patch is good. It fixes the problem. Thanks a lot, Janusz, for following up on this one. I'll mark it QAed and I'll increase the importance. -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 --- Comment #6 from David Cook --- Guessing I must need to update my koha-testing-docker because it's failing with a missing MARC/Lint.pm module... -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 --- Comment #5 from David Cook --- Without the patch, I tried to use the web UI to import this file, but it failed to load. In /var/log/koha/kohadev/worker-output.log I see the following: Record length of 527856 is larger than the MARC spec allows (9 bytes). at /usr/share/perl5/MARC/File/USMARC.pm line 314. UTF-8 "\x85" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. -- I applied the patch, ran 'restart_all', and tried the web UI again. And it's still failing to import. -- I'll try the test plan you've provided... -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 Magnus Enger changed: What|Removed |Added Status|Needs Signoff |Signed Off -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 Magnus Enger changed: What|Removed |Added Attachment #176701|0 |1 is obsolete|| --- Comment #4 from Magnus Enger --- Created attachment 176704 --> https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176704&action=edit Bug 38913: (bug 38416 follow-up) Elasticsearch indexing explodes with oversized records After Bug 38416 Elasticsearch indexing explodes with oversized records, especially with UTF encoded data. In Koha::SearchEngine::Elasticsearch::marc_records_to_documents a following snippet has been introduced: my $usmarc_record = $record->as_usmarc(); my $decoded_usmarc_record = MARC::Record->new_from_usmarc($usmarc_record); But if $record is oversized (> 9 bytes), it is OK for MARC::Record object, but not for $record->as_usmarc. The produced ISO 2709 string is not correct and hence cannot be properly converted back to MARC::Record object by new_from_usmarc. The result in this case can be like: UTF-8 "\x85" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. Since it is done without any eval / try, the whole reindex procedure (for instance rebuild_elasticsearch.pl) is being randomly interrupted with no explanation. Test plan: == Hard to reproduce. But the explanation together with discussion in Bug 38416 (from 2024-12-15) explains and justifies the need of this added eval. 1. Have a standard KTD installation with Elasticsearch. 2. Use the provided test record - add it to Koha with ./misc/migration_tools/bulkmarcimport.pl -b -file test.xml -m=MARCXML (have patience). During load process you should see a message like: UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. 3. The record should get biblionumber 439. Check in librarian interface with http://:8081/cgi-bin/koha/catalogue/detail.pl?biblionumber=439 that the record has been imported. However, you should not be able to make a search for this record. 4. Try to reindex with: ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439 You should get a message like: UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. Again, no search results. 5. Apply the patch ; restart_all. 6. Repeat reindex with: ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439 There should be no warning now and you should be able to find the record. Signed-off-by: Magnus Enger Followed the test plan. Works as advertised. -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 Janusz Kaczmarek changed: What|Removed |Added Attachment #176698|0 |1 is obsolete|| --- Comment #3 from Janusz Kaczmarek --- Created attachment 176701 --> https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176701&action=edit Bug 38913: (bug 38416 follow-up) Elasticsearch indexing explodes with oversized records After Bug 38416 Elasticsearch indexing explodes with oversized records, especially with UTF encoded data. In Koha::SearchEngine::Elasticsearch::marc_records_to_documents a following snippet has been introduced: my $usmarc_record = $record->as_usmarc(); my $decoded_usmarc_record = MARC::Record->new_from_usmarc($usmarc_record); But if $record is oversized (> 9 bytes), it is OK for MARC::Record object, but not for $record->as_usmarc. The produced ISO 2709 string is not correct and hence cannot be properly converted back to MARC::Record object by new_from_usmarc. The result in this case can be like: UTF-8 "\x85" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. Since it is done without any eval / try, the whole reindex procedure (for instance rebuild_elasticsearch.pl) is being randomly interrupted with no explanation. Test plan: == Hard to reproduce. But the explanation together with discussion in Bug 38416 (from 2024-12-15) explains and justifies the need of this added eval. 1. Have a standard KTD installation with Elasticsearch. 2. Use the provided test record - add it to Koha with ./misc/migration_tools/bulkmarcimport.pl -b -file test.xml -m=MARCXML (have patience). During load process you should see a message like: UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. 3. The record should get biblionumber 439. Check in librarian interface with http://:8081/cgi-bin/koha/catalogue/detail.pl?biblionumber=439 that the record has been imported. However, you should not be able to make a search for this record. 4. Try to reindex with: ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439 You should get a message like: UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. Again, no search results. 5. Apply the patch ; restart_all. 6. Repeat reindex with: ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439 There should be no warning now and you should be able to find the record. -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 --- Comment #2 from Janusz Kaczmarek --- Created attachment 176699 --> https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176699&action=edit Test record A test MARCXML record with lots of items producing oversized ISO 2709 record. -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 Janusz Kaczmarek changed: What|Removed |Added CC||dc...@prosentient.com.au, ||m.de.r...@rijksmuseum.nl, ||martin.renvoize@ptfs-europe ||.com, ||n...@bywatersolutions.com, ||nug...@gmail.com -- You are receiving this mail because: You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 Janusz Kaczmarek changed: What|Removed |Added Assignee|koha-b...@lists.koha-commun |janus...@gmail.com |ity.org | -- You are receiving this mail because: You are the assignee for the bug. You are watching all bug changes. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 --- Comment #1 from Janusz Kaczmarek --- Created attachment 176698 --> https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176698&action=edit Bug 38913: (bug 38416 follow-up) Elasticsearch indexing explodes with oversized records After Bug 38416 Elasticsearch indexing explodes with oversized records, especially with UTF encoded data. In Koha::SearchEngine::Elasticsearch::marc_records_to_documents a following snippet has been introduced: my $usmarc_record = $record->as_usmarc(); my $decoded_usmarc_record = MARC::Record->new_from_usmarc($usmarc_record); But if $record is oversized (> 9 bytes), it is OK for MARC::Record object, but not for $record->as_usmarc. The produced ISO 2709 string is not correct and hence cannot be properly converted back to MARC::Record object by new_from_usmarc. The result in this case can be like: UTF-8 "\x85" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm line 35. Since it is done without any eval / try, the whole reindex procedure (for instance rebuild_elasticsearch.pl) is being randomly interrupted with no explanation. Test plan: == Hard to reproduce. But the explanation together with discussion in Bug 38416 (from 2024-12-15) explains and justifies the need of this added eval. -- You are receiving this mail because: You are watching all bug changes. You are the assignee for the bug. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913 Janusz Kaczmarek changed: What|Removed |Added Patch complexity|--- |Trivial patch Status|NEW |Needs Signoff -- You are receiving this mail because: You are watching all bug changes. You are the assignee for the bug. ___ Koha-bugs mailing list Koha-bugs@lists.koha-community.org https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/