flowpoint created ARROW-17943:
---------------------------------

             Summary: core dumped when joining big large_strings
                 Key: ARROW-17943
                 URL: https://issues.apache.org/jira/browse/ARROW-17943
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 9.0.0
         Environment: run inside a fedora container:
registry.fedoraproject.org/fedora-toolbox:36

host information:
uname -a:

Linux ws1 5.18.16-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 3 15:44:49 UTC 
2022 x86_64 x86_64 x86_64 GNU/Linux

/etc/os-release:

NAME="Fedora Linux"
VERSION="36 (Container Image)"
ID=fedora
VERSION_ID=36
VERSION_CODENAME=""
PLATFORM_ID="platform:f36"
PRETTY_NAME="Fedora Linux 36 (Container Image)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:36"
HOME_URL="https://fedoraproject.org/";
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f36/system-administrators-guide/";
SUPPORT_URL="https://ask.fedoraproject.org/";
BUG_REPORT_URL="https://bugzilla.redhat.com/";
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=36
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=36
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy";
VARIANT="Container Image"
VARIANT_ID=container
            Reporter: flowpoint


joining large strings in pyarrow results in this error:


{code:java}
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted (core dumped) {code}

example code:
note that this needs quite some ram (run on 128GB)
{code:java}
import pyarrow as pa    
     
ids = [x for x in range(2**24)]    
text = ['a'*2**10]*2**24    
schema = pa.schema([    
    ('Id', pa.int32()),    
    ('Text', pa.large_string()),    
    ])    
     
tab1 = pa.Table.from_arrays([ids, text],schema=schema)    
tab2 = pa.Table.from_arrays([ids, text],schema=schema)    
     
joined = tab1.join(tab2, keys='Id', right_keys='Id', left_suffix='tab1')  {code}

the same results in a segfault, if i use this schema
{code:java}
schema = pa.schema([
    ('Id', pa.int32()),
    ('Text', pa.string()),
    ]){code}
 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to