[jira] [Comment Edited] (SOLR-16589) Large fields with large="true" can be truncated when using unicode values

Kevin Risden (Jira) Fri, 16 Dec 2022 05:03:05 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648617#comment-17648617
 ]


Kevin Risden edited comment on SOLR-16589 at 12/16/22 1:02 PM:
---------------------------------------------------------------

Well another reproducing seed:

{code:java}
./gradlew test --tests LargeFieldTest.test -Dtests.seed=A5EC77E6B81D86A2 
-Dtests.multiplier=3 -Dtests.locale=jgo-CM -Dtests.timezone=Asia/Ashgabat 
-Dtests.asserts=true -Dtests.file.encoding=UTF-8

...

  2> 2106 INFO  (TEST-LargeFieldTest.test-seed#[A5EC77E6B81D86A2]) [] 
o.a.s.SolrTestCaseJ4 ###Ending test
   >     org.junit.ComparisonFailure: expected:<[HXwf^Z-]DZSy> but 
was:<[#24;HXwf^Z-#1;]DZSy>
   >         at 
__randomizedtesting.SeedInfo.seed([A5EC77E6B81D86A2:2DB8483C16E1EB5A]:0)
   >         at org.junit.Assert.assertEquals(Assert.java:117)
   >         at org.junit.Assert.assertEquals(Assert.java:146)
   >         at 
org.apache.solr.search.LargeFieldTest.test(LargeFieldTest.java:113)
{code}

The string starts with a CAN character - 
https://www.compart.com/en/unicode/U+0018 which is encoded as #24; and a SOH 
character - https://www.compart.com/en/unicode/U+0001 that is encoded as #1;.

I haven't found WHY this is happening yet. It might be related to the standard 
analyzer used for the field type - that we don't get EXACTLY back what we put 
into the field.


was (Author: risdenk):
Well another reproducing seed:

{code:java}
./gradlew test --tests LargeFieldTest.test -Dtests.seed=A5EC77E6B81D86A2 
-Dtests.multiplier=3 -Dtests.locale=jgo-CM -Dtests.timezone=Asia/Ashgabat 
-Dtests.asserts=true -Dtests.file.encoding=UTF-8
{code}


> Large fields with large="true" can be truncated when using unicode values
> -------------------------------------------------------------------------
>
>                 Key: SOLR-16589
>                 URL: https://issues.apache.org/jira/browse/SOLR-16589
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: search
>    Affects Versions: 9.0, 9.1
>            Reporter: Nikolas Osvalds
>            Assignee: Kevin Risden
>            Priority: Major
>             Fix For: main (10.0), 9.2, 9.1.1
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> h3. Summary
> For fields using large="true", large fields (which is what they are intended 
> for) can be truncated in v9+ of Solr.
> Example fieldtype definition:
> {code:java}
> <fieldtype name="string_large"  class="solr.TextField" multiValued="false" 
> indexed="false" stored="true" omitNorms="true" large="true" />{code}
> h3. Cause
> Looks like this is a bug introduced along with 
> https://issues.apache.org/jira/browse/LUCENE-8805 / 
> https://github.com/apache/lucene/issues/9849
> The current code is here:
> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/search/SolrDocumentFetcher.java#L511
>  
> {code:java}
> public void stringField(FieldInfo fieldInfo, String value) throws IOException 
> {
>     Objects.requireNonNull(value, "String value should not be null");
>     bytesRef.bytes = value.getBytes(StandardCharsets.UTF_8);
>     bytesRef.length = value.length();
> {code}
>  
> Specifically with respect to "large" fields handling.
> The length in utf8 bytes will often be longer than the string length 
> `value.length()`, hence the truncation.
> h3. Fix
> {code:java}
> bytesRef.length = bytesRef.bytes.length {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Comment Edited] (SOLR-16589) Large fields with large="true" can be truncated when using unicode values

Reply via email to