[ https://issues.apache.org/jira/browse/ARROW-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace updated ARROW-17740: -------------------------------- Attachment: test_join.cpp > [c++][compute]Is there any other way to use Join besides Acero? > --------------------------------------------------------------- > > Key: ARROW-17740 > URL: https://issues.apache.org/jira/browse/ARROW-17740 > Project: Apache Arrow > Issue Type: Improvement > Reporter: LinGeLin > Priority: Major > Attachments: data.zip, join_test.zip, test.cpp, test_join.cpp, > v4test.py > > > Acero performs poorly, and coredump occurs frequently! > > In the scenario I'm working on, I'll read one Parquet file and then several > other Parquet files. These files will have the same column name (UUID). I > need to join (by UUID), project (remove UUID), and filter (some custom > filtering) the results of the two reads. I found that Acero could only be > used to do join, but when I tested it, Acero performance was very poor and > very unstable, coredump often happened. Is there another way? Or just another > way to do a join! > > my project commit: > [链接|https://github.com/LinGeLin/io/commit/9b1b06d8d74154f0768bf5258cc3eaa2b9e20701] > tensorflow ==2.6.2 > you can build tfio as follows: > ./configure.sh > bazel build -s- -verbose_failures $BAZEL_OPTIMIZATION //tensorflow_io/... > //tensorflow_io_gcs_filesystem/... --compilation_mode=opt --copt=-msse4.2 > --copt=-mfma --copt=-mavx2 > python setup.py bdist_wheel --data bazel-bin > pip install dist/tensorflow_io-0.21.0-cp38-cp38-linux_x86_64.whl > --force-reinstall --no-deps > > run v4test.py to test the dataset > > Data.zip contains several parquet files, which are stored on S3 in my > scenario. > I have copied some of the code into test.cpp and can only see the general > flow, not compiled > -- This message was sent by Atlassian Jira (v8.20.10#820010)