[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-19 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

--- Comment #11 from David Cook  ---
Created attachment 176768
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176768&action=edit
Bug 38913: (QA follow-up) test UTF-8 exceptions in large MARC records

MARC records with over 9 bytes are invalid by spec, and when you use
UTF-8 encoded characters in your MARC records, there is the potential
to generate fatal errors in MARC::File::USMARC when it runs
"marc_to_utf8" from "MARC::File::Encode" during its "decode" operation.

That is, if you MARC::File::USMARC->encode a MARC record
with over 9 bytes (including a number of UTF-8 bytes), there
is the potential when you run MARC::File:USMARC->decode on that same
data that you'll generate a fatal exception.

The main patch in bug 38913 wraps the function doing the decode,
so that a bad record doesn't crash processing.

Without the patch, this unit test will fail. With the patch, this
unit test will pass.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-19 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

--- Comment #10 from David Cook  ---
(In reply to David Cook from comment #9)
> Let's see if I can break the unit test in main...

I added a bunch of Chinese and some of the Polish from the test record, and I
couldn't get the unit tests to break in main. 

I was about to give up... when I tried again, and I managed to get the
following:

kohadev-koha@kohadevbox:koha(main)$ prove
t/db_dependent/Koha/SearchEngine/Elasticsearch.t
t/db_dependent/Koha/SearchEngine/Elasticsearch.t .. 1/8 # Looks like you
planned 70 tests but ran 55.

#   Failed test 'Koha::SearchEngine::Elasticsearch::marc_records_to_documents
() tests'
#   at t/db_dependent/Koha/SearchEngine/Elasticsearch.t line 805.
UTF-8 "\x99" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.
# Looks like your test exited with 11 just after 4.

Hurray!

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-19 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

--- Comment #9 from David Cook  ---
Something is still bugging me...

t/db_dependent/Koha/SearchEngine/Elasticsearch.t

In theory, that test script has a test for a "large MARC record" which runs
marc_records_to_documents() and it doesn't fail on "main".

It must be longer than 9 bytes since it's switching to MARCXML from
base64ISO2709.

So we did test for large MARC records om bug 38416. 

But... since the fatal error Janusz is fixing comes from MARC::File::USMARC not
handling an exception during marc_to_utf8(), it must also be because being over
9 bytes creates an invalid USMARC directory in the USMARC data. It then
starts doing string handling using the positional math. And then it needs to
get the right combination of invalid bytes. 

Since the unit test was just using single byte ASCII...  it should be
impossible to generate invalid UTF8 in this particular test scenario.

Let's see if I can break the unit test in main...

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-19 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

David Cook  changed:

   What|Removed |Added

 Status|Signed Off  |Passed QA

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-19 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

David Cook  changed:

   What|Removed |Added

 Attachment #176704|0   |1
is obsolete||

--- Comment #8 from David Cook  ---
Created attachment 176767
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176767&action=edit
Bug 38913: (bug 38416 follow-up) Elasticsearch indexing explodes with oversized
records

After Bug 38416 Elasticsearch indexing explodes with oversized
records, especially with UTF encoded data.

In Koha::SearchEngine::Elasticsearch::marc_records_to_documents a
following snippet has been introduced:

my $usmarc_record = $record->as_usmarc();
my $decoded_usmarc_record = MARC::Record->new_from_usmarc($usmarc_record);

But if $record is oversized (> 9 bytes), it is OK for MARC::Record
object, but not for $record->as_usmarc. The produced ISO 2709 string
is not correct and hence cannot be properly converted back to
MARC::Record object by new_from_usmarc.

The result in this case can be like:

UTF-8 "\x85" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.

Since it is done without any eval / try, the whole reindex procedure
(for instance rebuild_elasticsearch.pl) is being randomly interrupted
with no explanation.

Test plan:
==
Hard to reproduce. But the explanation together with discussion in Bug
38416 (from 2024-12-15) explains and justifies the need of this added
eval.

1. Have a standard KTD installation with Elasticsearch.
2. Use the provided test record - add it to Koha with
   ./misc/migration_tools/bulkmarcimport.pl -b -file test.xml -m=MARCXML
   (have patience).
   During load process you should see a message like:
   UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.
3. The record should get biblionumber 439. Check in librarian interface with
  
http://:8081/cgi-bin/koha/catalogue/detail.pl?biblionumber=439
   that the record has been imported.
   However, you should not be able to make a search for this record.
4. Try to reindex with:
   ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439
   You should get a message like:
   UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.
   Again, no search results.
5. Apply the patch ; restart_all.
6. Repeat reindex with:
   ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439
   There should be no warning now and you should be able to find the record.

Signed-off-by: Magnus Enger 
Followed the test plan. Works as advertised.
Signed-off-by: David Cook 

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-19 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

David Cook  changed:

   What|Removed |Added

   Severity|major   |blocker
 QA Contact|testo...@bugs.koha-communit |dc...@prosentient.com.au
   |y.org   |

--- Comment #7 from David Cook  ---
While my ktd was downloading, I looked at the code and warnings more and I see
the logic here now. I must've been so focused on bug 38416 on fields with more
than  that we forgot to test records with more than 9. The MARC::*
modules have some interesting quirks. MARC::File::USMARC could use some
attention...

--

So this patch is good. It fixes the problem. Thanks a lot, Janusz, for
following up on this one.

I'll mark it QAed and I'll increase the importance.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-19 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

--- Comment #6 from David Cook  ---
Guessing I must need to update my koha-testing-docker because it's failing with
a missing MARC/Lint.pm module...

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-19 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

--- Comment #5 from David Cook  ---
Without the patch, I tried to use the web UI to import this file, but it failed
to load. In /var/log/koha/kohadev/worker-output.log I see the following:

Record length of 527856 is larger than the MARC spec allows (9 bytes). at
/usr/share/perl5/MARC/File/USMARC.pm line 314.
UTF-8 "\x85" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.

--

I applied the patch, ran 'restart_all', and tried the web UI again.

And it's still failing to import.

--

I'll try the test plan you've provided...

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-16 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

Magnus Enger  changed:

   What|Removed |Added

 Status|Needs Signoff   |Signed Off

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-16 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

Magnus Enger  changed:

   What|Removed |Added

 Attachment #176701|0   |1
is obsolete||

--- Comment #4 from Magnus Enger  ---
Created attachment 176704
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176704&action=edit
Bug 38913: (bug 38416 follow-up) Elasticsearch indexing explodes with oversized
records

After Bug 38416 Elasticsearch indexing explodes with oversized
records, especially with UTF encoded data.

In Koha::SearchEngine::Elasticsearch::marc_records_to_documents a
following snippet has been introduced:

my $usmarc_record = $record->as_usmarc();
my $decoded_usmarc_record = MARC::Record->new_from_usmarc($usmarc_record);

But if $record is oversized (> 9 bytes), it is OK for MARC::Record
object, but not for $record->as_usmarc. The produced ISO 2709 string
is not correct and hence cannot be properly converted back to
MARC::Record object by new_from_usmarc.

The result in this case can be like:

UTF-8 "\x85" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.

Since it is done without any eval / try, the whole reindex procedure
(for instance rebuild_elasticsearch.pl) is being randomly interrupted
with no explanation.

Test plan:
==
Hard to reproduce. But the explanation together with discussion in Bug
38416 (from 2024-12-15) explains and justifies the need of this added
eval.

1. Have a standard KTD installation with Elasticsearch.
2. Use the provided test record - add it to Koha with
   ./misc/migration_tools/bulkmarcimport.pl -b -file test.xml -m=MARCXML
   (have patience).
   During load process you should see a message like:
   UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.
3. The record should get biblionumber 439. Check in librarian interface with
  
http://:8081/cgi-bin/koha/catalogue/detail.pl?biblionumber=439
   that the record has been imported.
   However, you should not be able to make a search for this record.
4. Try to reindex with:
   ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439
   You should get a message like:
   UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.
   Again, no search results.
5. Apply the patch ; restart_all.
6. Repeat reindex with:
   ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439
   There should be no warning now and you should be able to find the record.

Signed-off-by: Magnus Enger 
Followed the test plan. Works as advertised.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-16 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

Janusz Kaczmarek  changed:

   What|Removed |Added

 Attachment #176698|0   |1
is obsolete||

--- Comment #3 from Janusz Kaczmarek  ---
Created attachment 176701
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176701&action=edit
Bug 38913: (bug 38416 follow-up) Elasticsearch indexing explodes with oversized
records

After Bug 38416 Elasticsearch indexing explodes with oversized
records, especially with UTF encoded data.

In Koha::SearchEngine::Elasticsearch::marc_records_to_documents a
following snippet has been introduced:

my $usmarc_record = $record->as_usmarc();
my $decoded_usmarc_record = MARC::Record->new_from_usmarc($usmarc_record);

But if $record is oversized (> 9 bytes), it is OK for MARC::Record
object, but not for $record->as_usmarc. The produced ISO 2709 string
is not correct and hence cannot be properly converted back to
MARC::Record object by new_from_usmarc.

The result in this case can be like:

UTF-8 "\x85" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.

Since it is done without any eval / try, the whole reindex procedure
(for instance rebuild_elasticsearch.pl) is being randomly interrupted
with no explanation.

Test plan:
==
Hard to reproduce. But the explanation together with discussion in Bug
38416 (from 2024-12-15) explains and justifies the need of this added
eval.

1. Have a standard KTD installation with Elasticsearch.
2. Use the provided test record - add it to Koha with
   ./misc/migration_tools/bulkmarcimport.pl -b -file test.xml -m=MARCXML
   (have patience).
   During load process you should see a message like:
   UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.
3. The record should get biblionumber 439. Check in librarian interface with
  
http://:8081/cgi-bin/koha/catalogue/detail.pl?biblionumber=439
   that the record has been imported.
   However, you should not be able to make a search for this record.
4. Try to reindex with:
   ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439
   You should get a message like:
   UTF-8 "\xC4" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.
   Again, no search results.
5. Apply the patch ; restart_all.
6. Repeat reindex with:
   ./misc/search_tools/rebuild_elasticsearch.pl -b -bn 439
   There should be no warning now and you should be able to find the record.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-16 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

--- Comment #2 from Janusz Kaczmarek  ---
Created attachment 176699
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176699&action=edit
Test record

A test MARCXML record with lots of items producing oversized ISO 2709 record.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-16 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

Janusz Kaczmarek  changed:

   What|Removed |Added

 CC||dc...@prosentient.com.au,
   ||m.de.r...@rijksmuseum.nl,
   ||martin.renvoize@ptfs-europe
   ||.com,
   ||n...@bywatersolutions.com,
   ||nug...@gmail.com

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-16 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

Janusz Kaczmarek  changed:

   What|Removed |Added

   Assignee|koha-b...@lists.koha-commun |janus...@gmail.com
   |ity.org |

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-16 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

--- Comment #1 from Janusz Kaczmarek  ---
Created attachment 176698
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=176698&action=edit
Bug 38913: (bug 38416 follow-up) Elasticsearch indexing explodes with oversized
records

After Bug 38416 Elasticsearch indexing explodes with oversized
records, especially with UTF encoded data.

In Koha::SearchEngine::Elasticsearch::marc_records_to_documents a
following snippet has been introduced:

my $usmarc_record = $record->as_usmarc();
my $decoded_usmarc_record = MARC::Record->new_from_usmarc($usmarc_record);

But if $record is oversized (> 9 bytes), it is OK for MARC::Record
object, but not for $record->as_usmarc. The produced ISO 2709 string
is not correct and hence cannot be properly converted back to
MARC::Record object by new_from_usmarc.

The result in this case can be like:

UTF-8 "\x85" does not map to Unicode at /usr/share/perl5/MARC/File/Encode.pm
line 35.

Since it is done without any eval / try, the whole reindex procedure
(for instance rebuild_elasticsearch.pl) is being randomly interrupted
with no explanation.

Test plan:
==
Hard to reproduce. But the explanation together with discussion in Bug
38416 (from 2024-12-15) explains and justifies the need of this added
eval.

-- 
You are receiving this mail because:
You are watching all bug changes.
You are the assignee for the bug.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 38913] Elasticsearch indexing explodes with oversized records

2025-01-16 Thread bugzilla-daemon--- via Koha-bugs
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38913

Janusz Kaczmarek  changed:

   What|Removed |Added

   Patch complexity|--- |Trivial patch
 Status|NEW |Needs Signoff

-- 
You are receiving this mail because:
You are watching all bug changes.
You are the assignee for the bug.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/