flowpoint created ARROW-17943: --------------------------------- Summary: core dumped when joining big large_strings Key: ARROW-17943 URL: https://issues.apache.org/jira/browse/ARROW-17943 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 9.0.0 Environment: run inside a fedora container: registry.fedoraproject.org/fedora-toolbox:36
host information: uname -a: Linux ws1 5.18.16-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 3 15:44:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux /etc/os-release: NAME="Fedora Linux" VERSION="36 (Container Image)" ID=fedora VERSION_ID=36 VERSION_CODENAME="" PLATFORM_ID="platform:f36" PRETTY_NAME="Fedora Linux 36 (Container Image)" ANSI_COLOR="0;38;2;60;110;180" LOGO=fedora-logo-icon CPE_NAME="cpe:/o:fedoraproject:fedora:36" HOME_URL="https://fedoraproject.org/" DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f36/system-administrators-guide/" SUPPORT_URL="https://ask.fedoraproject.org/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Fedora" REDHAT_BUGZILLA_PRODUCT_VERSION=36 REDHAT_SUPPORT_PRODUCT="Fedora" REDHAT_SUPPORT_PRODUCT_VERSION=36 PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy" VARIANT="Container Image" VARIANT_ID=container Reporter: flowpoint joining large strings in pyarrow results in this error: {code:java} terminate called after throwing an instance of 'std::length_error' what(): vector::_M_default_append Aborted (core dumped) {code} example code: note that this needs quite some ram (run on 128GB) {code:java} import pyarrow as pa ids = [x for x in range(2**24)] text = ['a'*2**10]*2**24 schema = pa.schema([ ('Id', pa.int32()), ('Text', pa.large_string()), ]) tab1 = pa.Table.from_arrays([ids, text],schema=schema) tab2 = pa.Table.from_arrays([ids, text],schema=schema) joined = tab1.join(tab2, keys='Id', right_keys='Id', left_suffix='tab1') {code} the same results in a segfault, if i use this schema {code:java} schema = pa.schema([ ('Id', pa.int32()), ('Text', pa.string()), ]){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)