[jira] [Resolved] (ARROW-11563) [Rust] Support Cast(Utf8, TimeStamp(Nanoseconds, None))
[ https://issues.apache.org/jira/browse/ARROW-11563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb resolved ARROW-11563. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9449 [https://github.com/apache/arrow/pull/9449] > [Rust] Support Cast(Utf8, TimeStamp(Nanoseconds, None)) > --- > > Key: ARROW-11563 > URL: https://issues.apache.org/jira/browse/ARROW-11563 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Patsura Dmitry >Assignee: Patsura Dmitry >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11557) [Rust] Add table de-registration to DataFusion ExecutionContext
[ https://issues.apache.org/jira/browse/ARROW-11557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb updated ARROW-11557: Component/s: Rust > [Rust] Add table de-registration to DataFusion ExecutionContext > --- > > Key: ARROW-11557 > URL: https://issues.apache.org/jira/browse/ARROW-11557 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Marc Prud'hommeaux >Priority: Minor > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > Table de-registration, as discussed at > https://lists.apache.org/thread.html/r0b3bc62a720c204c5bbe26d8157963276f7d61c05fcbad7eaf2ae9ff%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11557) [Rust] Add table de-registration to DataFusion ExecutionContext
[ https://issues.apache.org/jira/browse/ARROW-11557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb resolved ARROW-11557. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9445 [https://github.com/apache/arrow/pull/9445] > [Rust] Add table de-registration to DataFusion ExecutionContext > --- > > Key: ARROW-11557 > URL: https://issues.apache.org/jira/browse/ARROW-11557 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Marc Prud'hommeaux >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Table de-registration, as discussed at > https://lists.apache.org/thread.html/r0b3bc62a720c204c5bbe26d8157963276f7d61c05fcbad7eaf2ae9ff%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11607) [Python] Error when reading table with list values from parquet
Michal Glaus created ARROW-11607: Summary: [Python] Error when reading table with list values from parquet Key: ARROW-11607 URL: https://issues.apache.org/jira/browse/ARROW-11607 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 3.0.0, 2.0.0, 1.0.1, 1.0.0 Environment: Python 3.7 Reporter: Michal Glaus I'm getting unexpected results when reading tables containing list values and a large number of rows from a parquet file. Example code (pyarrow 2.0.0 and 3.0.0): {code:java} from pyarrow import parquet, Table data = [None] * (1 << 20) data.append([1]) table = Table.from_arrays([data], ['column']) print('Expected: %s' % table['column'][-1]) parquet.write_table(table, 'table.parquet') table2 = parquet.read_table('table.parquet') print('Actual: %s' % table2['column'][-1]{code} Output: {noformat} Expected: [1] Actual: [0]{noformat} When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get: {noformat} Expected: [1] Actual: [1]{noformat} For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15. It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime): {noformat} data.append([{'a': 0.1, 'b': datetime.now()}]) {noformat} I'm getting this exception after calling table2.to_pandas() : {noformat} /arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default memory pool{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11539) [Developer][Archery] Change items_per_seconds units
[ https://issues.apache.org/jira/browse/ARROW-11539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-11539. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9433 [https://github.com/apache/arrow/pull/9433] > [Developer][Archery] Change items_per_seconds units > --- > > Key: ARROW-11539 > URL: https://issues.apache.org/jira/browse/ARROW-11539 > Project: Apache Arrow > Issue Type: Improvement > Components: Archery, Developer Tools >Reporter: Diana Clarke >Assignee: Diana Clarke >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Antoine requested that I change the units in {{items_per_seconds_fmt}} to be: > - K items/sec > - M items/sec > - G items/sec > Rather than: > - k items/sec > - m items/sec > - b items/sec -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11608) [CI] turbodbc integration tests are failing (build isue)
Joris Van den Bossche created ARROW-11608: - Summary: [CI] turbodbc integration tests are failing (build isue) Key: ARROW-11608 URL: https://issues.apache.org/jira/browse/ARROW-11608 Project: Apache Arrow Issue Type: Improvement Components: CI Reporter: Joris Van den Bossche Both turbodbc builds are failing, see eg https://github.com/ursacomputing/crossbow/runs/1885201762 It seems a failure to build turbodbc: {code} /build/turbodbc / -- The CXX compiler identification is GNU 9.3.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /opt/conda/envs/arrow/bin/x86_64-conda-linux-gnu-c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Build type: Debug CMake Error at CMakeLists.txt:14 (add_subdirectory): add_subdirectory given source "pybind11" which is not an existing directory. -- Found GTest: /opt/conda/envs/arrow/lib/libgtest.so -- Found Boost: /opt/conda/envs/arrow/include (found version "1.74.0") found components: locale -- Detecting unixODBC library -- Found header files at: /opt/conda/envs/arrow/include -- Found library at: /opt/conda/envs/arrow/lib/libodbc.so -- Found Boost: /opt/conda/envs/arrow/include (found version "1.74.0") found components: system date_time locale -- Detecting unixODBC library -- Found header files at: /opt/conda/envs/arrow/include -- Found library at: /opt/conda/envs/arrow/lib/libodbc.so -- Found Boost: /opt/conda/envs/arrow/include (found version "1.74.0") found components: system -- Detecting unixODBC library -- Found header files at: /opt/conda/envs/arrow/include -- Found library at: /opt/conda/envs/arrow/lib/libodbc.so CMake Error at cpp/turbodbc_python/Library/CMakeLists.txt:3 (pybind11_add_module): Unknown CMake command "pybind11_add_module". -- Configuring incomplete, errors occurred! See also "/build/turbodbc/CMakeFiles/CMakeOutput.log". See also "/build/turbodbc/CMakeFiles/CMakeError.log". 1 Error: `docker-compose --file /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e SETUPTOOLS_SCM_PRETEND_VERSION=3.1.0.dev174 conda-python-turbodbc` exited with a non-zero exit code 1, see the process log above. {code} cc [~uwe] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11586) [Rust] [Datafusion] Invalid SQL sometimes panics
[ https://issues.apache.org/jira/browse/ARROW-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11586: --- Labels: pull-request-available (was: ) > [Rust] [Datafusion] Invalid SQL sometimes panics > > > Key: ARROW-11586 > URL: https://issues.apache.org/jira/browse/ARROW-11586 > Project: Apache Arrow > Issue Type: Bug >Reporter: Marc Prud'hommeaux >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Executing the invalid SQL "select 1 order by x" will panic rather returning > an Err: > ``` > thread '' panicked at 'called `Result::unwrap()` on an `Err` value: > Plan("Invalid identifier \'x\' for schema Int64(1)")', > /Users/marc/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/643f420/rust/datafusion/src/sql/planner.rs:649:76 > stack backtrace: >0: _rust_begin_unwind >1: core::panicking::panic_fmt >2: core::option::expect_none_failed >3: core::result::Result::unwrap >4: datafusion::sql::planner::SqlToRel::order_by::{{closure}} >5: core::iter::adapters::map_try_fold::{{closure}} >6: core::iter::traits::iterator::Iterator::try_fold >7: as > core::iter::traits::iterator::Iterator>::try_fold >8: as > core::iter::traits::iterator::Iterator>::try_fold >9: core::iter::traits::iterator::Iterator::find > 10: as > core::iter::traits::iterator::Iterator>::next > 11: as alloc::vec::SpecFromIterNested>::from_iter > 12: as alloc::vec::SpecFromIter>::from_iter > 13: as > core::iter::traits::collect::FromIterator>::from_iter > 14: core::iter::traits::iterator::Iterator::collect > 15: as > core::iter::traits::collect::FromIterator>>::from_iter::{{closure}} > 16: core::iter::adapters::process_results > 17: as > core::iter::traits::collect::FromIterator>>::from_iter > 18: core::iter::traits::iterator::Iterator::collect > 19: datafusion::sql::planner::SqlToRel::order_by > 20: datafusion::sql::planner::SqlToRel::query_to_plan > 21: datafusion::sql::planner::SqlToRel::sql_statement_to_plan > 22: datafusion::sql::planner::SqlToRel::statement_to_plan > 23: datafusion::execution::context::ExecutionContext::create_logical_plan > ``` > This is happening because of an `unwrap` at > https://github.com/apache/arrow/blob/6cfbd22b457d873365fa60df31905857856608ee/rust/datafusion/src/sql/planner.rs#L652. > > Perhaps the error should be returned as the Result rather than panicking, so > the error can be handled? There are a number of other places in the planner > where `unwrap()` is used, so they may warrant similar treatment. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283697#comment-17283697 ] Joris Van den Bossche commented on ARROW-11456: --- bq. Note that you may be able to do the conversion manually and force a Arrow large_string type, though I'm not sure Pandas allows that. Yes, pandas allows that by specifying a pyarrow schema manually (instead of letting pyarrow infer that from the dataframe). For the example above, that would look like: {code} df.to_parquet(out, engine="pyarrow", compression="lz4", index=False, schema=pa.schema([("s", pa.large_string())])) {code} > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.1.5 / 1.2.1 > smart_open 4.1.2 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading or writing a large parquet file, I have this error: > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 2.5 billion unique > rows. Each value was effectively a unique base64 encoded length 24 string. > Below is code to reproduce the issue: > {code:python} > from base64 import urlsafe_b64encode > import numpy as np > import pandas as pd > import pyarrow as pa > import smart_open > def num_to_b64(num: int) -> str: > return urlsafe_b64encode(num.to_bytes(16, "little")).decode() > df = > pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s") > with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file: > df.to_parquet(output_file, engine="pyarrow", compression="gzip", > index=False) > {code} > The dataframe is created correctly. When attempting to write it as a parquet > file, the last line of the above code leads to the error: > {noformat} > pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more > than 2147483646 child elements, got 25 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283697#comment-17283697 ] Joris Van den Bossche edited comment on ARROW-11456 at 2/12/21, 1:44 PM: - bq. Note that you may be able to do the conversion manually and force a Arrow large_string type, though I'm not sure Pandas allows that. Yes, pandas allows that by specifying a pyarrow schema manually (instead of letting pyarrow infer that from the dataframe). For the example above, that would look like: {code} df.to_parquet(out, engine="pyarrow", compression="lz4", index=False, schema=pa.schema([("s", pa.large_string())])) {code} [~apacman] does that help as a work-around? was (Author: jorisvandenbossche): bq. Note that you may be able to do the conversion manually and force a Arrow large_string type, though I'm not sure Pandas allows that. Yes, pandas allows that by specifying a pyarrow schema manually (instead of letting pyarrow infer that from the dataframe). For the example above, that would look like: {code} df.to_parquet(out, engine="pyarrow", compression="lz4", index=False, schema=pa.schema([("s", pa.large_string())])) {code} > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.1.5 / 1.2.1 > smart_open 4.1.2 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading or writing a large parquet file, I have this error: > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 2.5 billion unique > rows. Each value was effectively a unique base64 encoded length 24 string. > Below is code to reproduce the issue: > {code:python} > from base64 import urlsafe_b64encode > import numpy as np > import pandas as pd > import pyarrow as pa > import smart_open > def num_to_b64(num: int) -> str: > return urlsafe_b64encode(num.to_bytes(16, "little")).decode() > df = > pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s") > with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file: > df.to_parquet(output_file, engine="pyarrow", compression="gzip", > index=False) > {code} > The dataframe is created correctly. When attempting to write it as a parquet > file, the last line of the above code leads to the error: > {noformat} > pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more > than 2147483646 child elements, got 25 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11596) [Python][Dataset] SIGSEGV when executing scan tasks with Python executors
[ https://issues.apache.org/jira/browse/ARROW-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-11596: - Summary: [Python][Dataset] SIGSEGV when executing scan tasks with Python executors (was: [C++][Python][Dataset] SIGSEGV when executing scan tasks with Python executors) > [Python][Dataset] SIGSEGV when executing scan tasks with Python executors > - > > Key: ARROW-11596 > URL: https://issues.apache.org/jira/browse/ARROW-11596 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: dataset, datasets > > This crashes for me with a segfault: > {code:python} > import concurrent.futures > import queue > import numpy as np > import pyarrow as pa > import pyarrow.dataset as ds > import pyarrow.fs as fs > import pyarrow.parquet as pq > schema = pa.schema([("foo", pa.float64())]) > table = pa.table([np.random.uniform(size=1024)], schema=schema) > path = "/tmp/foo.parquet" > pq.write_table(table, path) > dataset = pa.dataset.FileSystemDataset.from_paths( > [path], > schema=schema, > format=ds.ParquetFileFormat(), > filesystem=fs.LocalFileSystem(), > ) > with concurrent.futures.ThreadPoolExecutor(2) as executor: > tasks = dataset.scan() > q = queue.Queue() > def _prebuffer(): > for task in tasks: > iterator = task.execute() > next(iterator) > q.put(iterator) > executor.submit(_prebuffer).result() > next(q.get()) > {code} > {noformat} > $ uname -a > Linux chaconne 5.10.4-arch2-1 #1 SMP PREEMPT Fri, 01 Jan 2021 05:29:53 + > x86_64 GNU/Linux > $ pip freeze > numpy==1.20.1 > pyarrow==3.0.0 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11596) [C++][Python][Dataset] SIGSEGV when executing scan tasks with Python executors
[ https://issues.apache.org/jira/browse/ARROW-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-11596: - Component/s: (was: C++) > [C++][Python][Dataset] SIGSEGV when executing scan tasks with Python executors > -- > > Key: ARROW-11596 > URL: https://issues.apache.org/jira/browse/ARROW-11596 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: dataset, datasets > > This crashes for me with a segfault: > {code:python} > import concurrent.futures > import queue > import numpy as np > import pyarrow as pa > import pyarrow.dataset as ds > import pyarrow.fs as fs > import pyarrow.parquet as pq > schema = pa.schema([("foo", pa.float64())]) > table = pa.table([np.random.uniform(size=1024)], schema=schema) > path = "/tmp/foo.parquet" > pq.write_table(table, path) > dataset = pa.dataset.FileSystemDataset.from_paths( > [path], > schema=schema, > format=ds.ParquetFileFormat(), > filesystem=fs.LocalFileSystem(), > ) > with concurrent.futures.ThreadPoolExecutor(2) as executor: > tasks = dataset.scan() > q = queue.Queue() > def _prebuffer(): > for task in tasks: > iterator = task.execute() > next(iterator) > q.put(iterator) > executor.submit(_prebuffer).result() > next(q.get()) > {code} > {noformat} > $ uname -a > Linux chaconne 5.10.4-arch2-1 #1 SMP PREEMPT Fri, 01 Jan 2021 05:29:53 + > x86_64 GNU/Linux > $ pip freeze > numpy==1.20.1 > pyarrow==3.0.0 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11596) [Python][Dataset] SIGSEGV when executing scan tasks with Python executors
[ https://issues.apache.org/jira/browse/ARROW-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11596: --- Labels: dataset datasets pull-request-available (was: dataset datasets) > [Python][Dataset] SIGSEGV when executing scan tasks with Python executors > - > > Key: ARROW-11596 > URL: https://issues.apache.org/jira/browse/ARROW-11596 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: dataset, datasets, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This crashes for me with a segfault: > {code:python} > import concurrent.futures > import queue > import numpy as np > import pyarrow as pa > import pyarrow.dataset as ds > import pyarrow.fs as fs > import pyarrow.parquet as pq > schema = pa.schema([("foo", pa.float64())]) > table = pa.table([np.random.uniform(size=1024)], schema=schema) > path = "/tmp/foo.parquet" > pq.write_table(table, path) > dataset = pa.dataset.FileSystemDataset.from_paths( > [path], > schema=schema, > format=ds.ParquetFileFormat(), > filesystem=fs.LocalFileSystem(), > ) > with concurrent.futures.ThreadPoolExecutor(2) as executor: > tasks = dataset.scan() > q = queue.Queue() > def _prebuffer(): > for task in tasks: > iterator = task.execute() > next(iterator) > q.put(iterator) > executor.submit(_prebuffer).result() > next(q.get()) > {code} > {noformat} > $ uname -a > Linux chaconne 5.10.4-arch2-1 #1 SMP PREEMPT Fri, 01 Jan 2021 05:29:53 + > x86_64 GNU/Linux > $ pip freeze > numpy==1.20.1 > pyarrow==3.0.0 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11586) [Rust] [Datafusion] Invalid SQL sometimes panics
[ https://issues.apache.org/jira/browse/ARROW-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283728#comment-17283728 ] Marc Prud'hommeaux commented on ARROW-11586: Unless there is some specific reason to panic there, replacing the `.unwrap()` with `?` fixes the issue: https://github.com/apache/arrow/pull/9479/files. I wonder if the other `unwrap()` instances in that module could similarly be turned into Result? > [Rust] [Datafusion] Invalid SQL sometimes panics > > > Key: ARROW-11586 > URL: https://issues.apache.org/jira/browse/ARROW-11586 > Project: Apache Arrow > Issue Type: Bug >Reporter: Marc Prud'hommeaux >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Executing the invalid SQL "select 1 order by x" will panic rather returning > an Err: > ``` > thread '' panicked at 'called `Result::unwrap()` on an `Err` value: > Plan("Invalid identifier \'x\' for schema Int64(1)")', > /Users/marc/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/643f420/rust/datafusion/src/sql/planner.rs:649:76 > stack backtrace: >0: _rust_begin_unwind >1: core::panicking::panic_fmt >2: core::option::expect_none_failed >3: core::result::Result::unwrap >4: datafusion::sql::planner::SqlToRel::order_by::{{closure}} >5: core::iter::adapters::map_try_fold::{{closure}} >6: core::iter::traits::iterator::Iterator::try_fold >7: as > core::iter::traits::iterator::Iterator>::try_fold >8: as > core::iter::traits::iterator::Iterator>::try_fold >9: core::iter::traits::iterator::Iterator::find > 10: as > core::iter::traits::iterator::Iterator>::next > 11: as alloc::vec::SpecFromIterNested>::from_iter > 12: as alloc::vec::SpecFromIter>::from_iter > 13: as > core::iter::traits::collect::FromIterator>::from_iter > 14: core::iter::traits::iterator::Iterator::collect > 15: as > core::iter::traits::collect::FromIterator>>::from_iter::{{closure}} > 16: core::iter::adapters::process_results > 17: as > core::iter::traits::collect::FromIterator>>::from_iter > 18: core::iter::traits::iterator::Iterator::collect > 19: datafusion::sql::planner::SqlToRel::order_by > 20: datafusion::sql::planner::SqlToRel::query_to_plan > 21: datafusion::sql::planner::SqlToRel::sql_statement_to_plan > 22: datafusion::sql::planner::SqlToRel::statement_to_plan > 23: datafusion::execution::context::ExecutionContext::create_logical_plan > ``` > This is happening because of an `unwrap` at > https://github.com/apache/arrow/blob/6cfbd22b457d873365fa60df31905857856608ee/rust/datafusion/src/sql/planner.rs#L652. > > Perhaps the error should be returned as the Result rather than panicking, so > the error can be handled? There are a number of other places in the planner > where `unwrap()` is used, so they may warrant similar treatment. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283731#comment-17283731 ] Truc Lam Nguyen commented on ARROW-11497: - [~apitrou] [~emkornfield] I think we can make a final decision on this, I'm ok with the option that end users have some level of control to preserve the behaviour. Please let me know your thoughts, thanks :) > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction
[ https://issues.apache.org/jira/browse/ARROW-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283766#comment-17283766 ] Andy Grove commented on ARROW-11606: I understand the issue better now. In the DataFusion planner, the aggregate expressions are compiled against the schema of the input to the partial aggregate. These compiled expressions are then used to construct both the partial and final aggregates. In other words, the expressions for the Final aggregate are not compiled against it's input schema, but against the input schema of the Partial aggregate. This feels a little unnatural when implementing serde but I will think about this more and see how I can work around this. > [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction > - > > Key: ARROW-11606 > URL: https://issues.apache.org/jira/browse/ARROW-11606 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > > We have run into an issue in the Ballista project where we are reconstructing > the Final and Partial HashAggregateExec operators [1] for distributed > execution and we need some guidance. > The Partial HashAggregateExec gets created OK and executes correctly. > However, when we create the Final HashAggregateExec, it is not finding the > expected schema in the input operator. The partial exec outputs field names > ending with "[sum]" and "[count]" and so on but the final aggregate doesn't > seem to be looking for those names. > It is also worth noting that the Final and Partial executors are not > connected directly in this usage. > The Partial exec is executed and output streamed to disk. > The Final exec then runs against the output from the Partial exec. > We may need to make changes in DataFusion to allow other crates to support > this kind of use case? > [1] https://github.com/ballista-compute/ballista/pull/491 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11601) [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions
[ https://issues.apache.org/jira/browse/ARROW-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-11601: - Description: This can help performance on high-latency filesystems. (was: This can help performance on high-latency filesystems. However, some care will be needed as then we won't be able to create one Arrow reader per Parquet row group anymore.) > [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions > - > > Key: ARROW-11601 > URL: https://issues.apache.org/jira/browse/ARROW-11601 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 3.0.0 >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: dataset, datasets > > This can help performance on high-latency filesystems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction
[ https://issues.apache.org/jira/browse/ARROW-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11606: --- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction > - > > Key: ARROW-11606 > URL: https://issues.apache.org/jira/browse/ARROW-11606 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We have run into an issue in the Ballista project where we are reconstructing > the Final and Partial HashAggregateExec operators [1] for distributed > execution and we need some guidance. > The Partial HashAggregateExec gets created OK and executes correctly. > However, when we create the Final HashAggregateExec, it is not finding the > expected schema in the input operator. The partial exec outputs field names > ending with "[sum]" and "[count]" and so on but the final aggregate doesn't > seem to be looking for those names. > It is also worth noting that the Final and Partial executors are not > connected directly in this usage. > The Partial exec is executed and output streamed to disk. > The Final exec then runs against the output from the Partial exec. > We may need to make changes in DataFusion to allow other crates to support > this kind of use case? > [1] https://github.com/ballista-compute/ballista/pull/491 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11601) [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions
[ https://issues.apache.org/jira/browse/ARROW-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11601: --- Labels: dataset datasets pull-request-available (was: dataset datasets) > [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions > - > > Key: ARROW-11601 > URL: https://issues.apache.org/jira/browse/ARROW-11601 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 3.0.0 >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: dataset, datasets, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This can help performance on high-latency filesystems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11607) [Python] Error when reading table with list values from parquet
[ https://issues.apache.org/jira/browse/ARROW-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-11607: -- Fix Version/s: 4.0.0 > [Python] Error when reading table with list values from parquet > --- > > Key: ARROW-11607 > URL: https://issues.apache.org/jira/browse/ARROW-11607 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0 > Environment: Python 3.7 >Reporter: Michal Glaus >Priority: Major > Fix For: 4.0.0 > > > I'm getting unexpected results when reading tables containing list values and > a large number of rows from a parquet file. > Example code (pyarrow 2.0.0 and 3.0.0): > {code:java} > from pyarrow import parquet, Table > data = [None] * (1 << 20) > data.append([1]) > table = Table.from_arrays([data], ['column']) > print('Expected: %s' % table['column'][-1]) > parquet.write_table(table, 'table.parquet') > table2 = parquet.read_table('table.parquet') > print('Actual: %s' % table2['column'][-1]{code} > Output: > {noformat} > Expected: [1] > Actual: [0]{noformat} > When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get: > {noformat} > Expected: [1] > Actual: [1]{noformat} > For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15. > It seems that this is caused by some overflow and memory corruption because > in pyarrow 3.0.0 with more complex values (list of dictionaries with float > and datetime): > {noformat} > data.append([{'a': 0.1, 'b': datetime.now()}]) > {noformat} > I'm getting this exception after calling table2.to_pandas() : > {noformat} > /arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create > default memory pool{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11609) [C++][Docs] Trivial CMake dependency on Arrow fails at link stage
David Li created ARROW-11609: Summary: [C++][Docs] Trivial CMake dependency on Arrow fails at link stage Key: ARROW-11609 URL: https://issues.apache.org/jira/browse/ARROW-11609 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation Affects Versions: 3.0.0 Reporter: David Li The example in the docs here isn't sufficient: [https://arrow.apache.org/docs/cpp/cmake.html] It fails at link time because Arrow's transitive dependencies aren't included in the INTERFACE_LINK_LIBRARIES: {noformat} /usr/bin/ld: warning: libglog.so.0, needed by /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try using -rpath or -rpath-link) /usr/bin/ld: warning: libutf8proc.so.2, needed by /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try using -rpath or -rpath-link) /usr/bin/ld: warning: libaws-cpp-sdk-config.so, needed by /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try using -rpath or -rpath-link) # ...{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-11609) [C++][Docs] Trivial CMake dependency on Arrow fails at link stage
[ https://issues.apache.org/jira/browse/ARROW-11609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li closed ARROW-11609. Resolution: Not A Problem > [C++][Docs] Trivial CMake dependency on Arrow fails at link stage > - > > Key: ARROW-11609 > URL: https://issues.apache.org/jira/browse/ARROW-11609 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Affects Versions: 3.0.0 >Reporter: David Li >Priority: Major > > The example in the docs here isn't sufficient: > [https://arrow.apache.org/docs/cpp/cmake.html] > It fails at link time because Arrow's transitive dependencies aren't included > in the INTERFACE_LINK_LIBRARIES: > {noformat} > /usr/bin/ld: warning: libglog.so.0, needed by > /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try > using -rpath or -rpath-link) > /usr/bin/ld: warning: libutf8proc.so.2, needed by > /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try > using -rpath or -rpath-link) > /usr/bin/ld: warning: libaws-cpp-sdk-config.so, needed by > /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try > using -rpath or -rpath-link) > # ...{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11609) [C++][Docs] Trivial CMake dependency on Arrow fails at link stage
[ https://issues.apache.org/jira/browse/ARROW-11609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283901#comment-17283901 ] David Li commented on ARROW-11609: -- Ah, the actual issue here is needing rpath to contain the right directory. Including libutf8proc implicitly does that, but it seems ARROW-4065 intentionally removed the transitive dependencies from ArrowTargets.cmake. Instead downstream projects depending on Arrow can use {{target_link_directories(..., path/to/conda/env/lib)}} (it seems this is really only an issue when using Conda). Closing. > [C++][Docs] Trivial CMake dependency on Arrow fails at link stage > - > > Key: ARROW-11609 > URL: https://issues.apache.org/jira/browse/ARROW-11609 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Affects Versions: 3.0.0 >Reporter: David Li >Priority: Major > > The example in the docs here isn't sufficient: > [https://arrow.apache.org/docs/cpp/cmake.html] > It fails at link time because Arrow's transitive dependencies aren't included > in the INTERFACE_LINK_LIBRARIES: > {noformat} > /usr/bin/ld: warning: libglog.so.0, needed by > /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try > using -rpath or -rpath-link) > /usr/bin/ld: warning: libutf8proc.so.2, needed by > /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try > using -rpath or -rpath-link) > /usr/bin/ld: warning: libaws-cpp-sdk-config.so, needed by > /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try > using -rpath or -rpath-link) > # ...{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11610) [C++] Download boost from sourceforge instead of bintray
Neal Richardson created ARROW-11610: --- Summary: [C++] Download boost from sourceforge instead of bintray Key: ARROW-11610 URL: https://issues.apache.org/jira/browse/ARROW-11610 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 e.g. https://sourceforge.net/projects/boost/files/boost/1.67.0/boost_1_67_0.tar.gz -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11611) [C++] Move third party dependency mirrors from bintray
Neal Richardson created ARROW-11611: --- Summary: [C++] Move third party dependency mirrors from bintray Key: ARROW-11611 URL: https://issues.apache.org/jira/browse/ARROW-11611 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 We added copies of these a while back to handle rate limiting to our own bintray. We should either remove them or update and move them elsewhere. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11612) [C++] Rebuild trimmed boost bundle
Neal Richardson created ARROW-11612: --- Summary: [C++] Rebuild trimmed boost bundle Key: ARROW-11612 URL: https://issues.apache.org/jira/browse/ARROW-11612 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 And host somewhere other than bintray. We can prune it further now that we've dropped boost::regex, too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11613) [R] Move nightly C++ builds off of bintray
Neal Richardson created ARROW-11613: --- Summary: [R] Move nightly C++ builds off of bintray Key: ARROW-11613 URL: https://issues.apache.org/jira/browse/ARROW-11613 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11499) [Packaging] Remove all use of bintray
[ https://issues.apache.org/jira/browse/ARROW-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-11499: Description: Bintray is being shut down on May 1. https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/ I've made subtasks for the bintray usage other than the dl.bintray.com/apache/arrow repository we use for hosting release artifacts. was: Bintray is being shut down on May 1, and possibly as early as February 28 we won't be able to upload to it. https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/ Feel free to make subtasks to break out this work. > [Packaging] Remove all use of bintray > - > > Key: ARROW-11499 > URL: https://issues.apache.org/jira/browse/ARROW-11499 > Project: Apache Arrow > Issue Type: New Feature > Components: Packaging >Reporter: Neal Richardson >Priority: Blocker > Fix For: 4.0.0 > > > Bintray is being shut down on May 1. > https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/ > I've made subtasks for the bintray usage other than the > dl.bintray.com/apache/arrow repository we use for hosting release artifacts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11610) [C++] Download boost from sourceforge instead of bintray
[ https://issues.apache.org/jira/browse/ARROW-11610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11610: --- Labels: pull-request-available (was: ) > [C++] Download boost from sourceforge instead of bintray > > > Key: ARROW-11610 > URL: https://issues.apache.org/jira/browse/ARROW-11610 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > e.g. > https://sourceforge.net/projects/boost/files/boost/1.67.0/boost_1_67_0.tar.gz -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11611) [C++] Update third party dependency mirrors
[ https://issues.apache.org/jira/browse/ARROW-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-11611: Summary: [C++] Update third party dependency mirrors (was: [C++] Move third party dependency mirrors from bintray) > [C++] Update third party dependency mirrors > --- > > Key: ARROW-11611 > URL: https://issues.apache.org/jira/browse/ARROW-11611 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Neal Richardson >Priority: Major > Fix For: 4.0.0 > > > We added copies of these a while back to handle rate limiting to our own > bintray. We should either remove them or update and move them elsewhere. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-11611) [C++] Update third party dependency mirrors
[ https://issues.apache.org/jira/browse/ARROW-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-11611: --- Assignee: Ben Kietzman > [C++] Update third party dependency mirrors > --- > > Key: ARROW-11611 > URL: https://issues.apache.org/jira/browse/ARROW-11611 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Assignee: Ben Kietzman >Priority: Major > Fix For: 4.0.0 > > > We added copies of these a while back as GitHub releases to handle rate > limiting to our own bintray. We've since bumped our dependency versions but > didn't update our copies in these mirrors, so they're currently useless. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11611) [C++] Update third party dependency mirrors
[ https://issues.apache.org/jira/browse/ARROW-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-11611: Description: We added copies of these a while back as GitHub releases to handle rate limiting to our own bintray. We've since bumped our dependency versions but didn't update our copies in these mirrors, so they're currently useless. (was: We added copies of these a while back to handle rate limiting to our own bintray. We should either remove them or update and move them elsewhere.) > [C++] Update third party dependency mirrors > --- > > Key: ARROW-11611 > URL: https://issues.apache.org/jira/browse/ARROW-11611 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Neal Richardson >Priority: Major > Fix For: 4.0.0 > > > We added copies of these a while back as GitHub releases to handle rate > limiting to our own bintray. We've since bumped our dependency versions but > didn't update our copies in these mirrors, so they're currently useless. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11611) [C++] Update third party dependency mirrors
[ https://issues.apache.org/jira/browse/ARROW-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-11611: Parent: (was: ARROW-11499) Issue Type: Improvement (was: Sub-task) > [C++] Update third party dependency mirrors > --- > > Key: ARROW-11611 > URL: https://issues.apache.org/jira/browse/ARROW-11611 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Priority: Major > Fix For: 4.0.0 > > > We added copies of these a while back as GitHub releases to handle rate > limiting to our own bintray. We've since bumped our dependency versions but > didn't update our copies in these mirrors, so they're currently useless. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11593) Parquet does not support wasm32-unknown-unknown target
[ https://issues.apache.org/jira/browse/ARROW-11593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284005#comment-17284005 ] Dominik Moritz commented on ARROW-11593: If lz4 is the issue, maybe we could switch to https://github.com/PSeitz/lz4_flex, which compiles to WASM. > Parquet does not support wasm32-unknown-unknown target > -- > > Key: ARROW-11593 > URL: https://issues.apache.org/jira/browse/ARROW-11593 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Dominik Moritz >Priority: Major > > The Arrow crate successfully compiles to WebAssembly (e.g. > https://github.com/domoritz/arrow-wasm) but the Parquet crate currently does > not support the`wasm32-unknown-unknown` target. > Try out the repository at > https://github.com/domoritz/parquet-wasm/commit/e877f9ad9c45c09f73d98fab2a8ad384a802b2e0. > The problem seems to be in liblz4, even if I do not include lz4 in the > feature flags. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11614) [C++][Gandiva] Fix round() logic to return positive zero when argument is zero
Sagnik Chakraborty created ARROW-11614: -- Summary: [C++][Gandiva] Fix round() logic to return positive zero when argument is zero Key: ARROW-11614 URL: https://issues.apache.org/jira/browse/ARROW-11614 Project: Apache Arrow Issue Type: Bug Reporter: Sagnik Chakraborty Previously, round(0.0) and round(0.0, out_scale) were returning -0.0, with this patch round() returns +0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11614) [C++][Gandiva] Fix round() logic to return positive zero when argument is zero
[ https://issues.apache.org/jira/browse/ARROW-11614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11614: --- Labels: pull-request-available (was: ) > [C++][Gandiva] Fix round() logic to return positive zero when argument is zero > -- > > Key: ARROW-11614 > URL: https://issues.apache.org/jira/browse/ARROW-11614 > Project: Apache Arrow > Issue Type: Bug >Reporter: Sagnik Chakraborty >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Previously, round(0.0) and round(0.0, out_scale) were returning -0.0, with > this patch round() returns +0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction
[ https://issues.apache.org/jira/browse/ARROW-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb reassigned ARROW-11606: --- Assignee: Andy Grove > [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction > - > > Key: ARROW-11606 > URL: https://issues.apache.org/jira/browse/ARROW-11606 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > We have run into an issue in the Ballista project where we are reconstructing > the Final and Partial HashAggregateExec operators [1] for distributed > execution and we need some guidance. > The Partial HashAggregateExec gets created OK and executes correctly. > However, when we create the Final HashAggregateExec, it is not finding the > expected schema in the input operator. The partial exec outputs field names > ending with "[sum]" and "[count]" and so on but the final aggregate doesn't > seem to be looking for those names. > It is also worth noting that the Final and Partial executors are not > connected directly in this usage. > The Partial exec is executed and output streamed to disk. > The Final exec then runs against the output from the Partial exec. > We may need to make changes in DataFusion to allow other crates to support > this kind of use case? > [1] https://github.com/ballista-compute/ballista/pull/491 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction
[ https://issues.apache.org/jira/browse/ARROW-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb resolved ARROW-11606. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9481 [https://github.com/apache/arrow/pull/9481] > [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction > - > > Key: ARROW-11606 > URL: https://issues.apache.org/jira/browse/ARROW-11606 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > We have run into an issue in the Ballista project where we are reconstructing > the Final and Partial HashAggregateExec operators [1] for distributed > execution and we need some guidance. > The Partial HashAggregateExec gets created OK and executes correctly. > However, when we create the Final HashAggregateExec, it is not finding the > expected schema in the input operator. The partial exec outputs field names > ending with "[sum]" and "[count]" and so on but the final aggregate doesn't > seem to be looking for those names. > It is also worth noting that the Final and Partial executors are not > connected directly in this usage. > The Partial exec is executed and output streamed to disk. > The Final exec then runs against the output from the Partial exec. > We may need to make changes in DataFusion to allow other crates to support > this kind of use case? > [1] https://github.com/ballista-compute/ballista/pull/491 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11586) [Rust] [Datafusion] Invalid SQL sometimes panics
[ https://issues.apache.org/jira/browse/ARROW-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb resolved ARROW-11586. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9479 [https://github.com/apache/arrow/pull/9479] > [Rust] [Datafusion] Invalid SQL sometimes panics > > > Key: ARROW-11586 > URL: https://issues.apache.org/jira/browse/ARROW-11586 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Marc Prud'hommeaux >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Executing the invalid SQL "select 1 order by x" will panic rather returning > an Err: > ``` > thread '' panicked at 'called `Result::unwrap()` on an `Err` value: > Plan("Invalid identifier \'x\' for schema Int64(1)")', > /Users/marc/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/643f420/rust/datafusion/src/sql/planner.rs:649:76 > stack backtrace: >0: _rust_begin_unwind >1: core::panicking::panic_fmt >2: core::option::expect_none_failed >3: core::result::Result::unwrap >4: datafusion::sql::planner::SqlToRel::order_by::{{closure}} >5: core::iter::adapters::map_try_fold::{{closure}} >6: core::iter::traits::iterator::Iterator::try_fold >7: as > core::iter::traits::iterator::Iterator>::try_fold >8: as > core::iter::traits::iterator::Iterator>::try_fold >9: core::iter::traits::iterator::Iterator::find > 10: as > core::iter::traits::iterator::Iterator>::next > 11: as alloc::vec::SpecFromIterNested>::from_iter > 12: as alloc::vec::SpecFromIter>::from_iter > 13: as > core::iter::traits::collect::FromIterator>::from_iter > 14: core::iter::traits::iterator::Iterator::collect > 15: as > core::iter::traits::collect::FromIterator>>::from_iter::{{closure}} > 16: core::iter::adapters::process_results > 17: as > core::iter::traits::collect::FromIterator>>::from_iter > 18: core::iter::traits::iterator::Iterator::collect > 19: datafusion::sql::planner::SqlToRel::order_by > 20: datafusion::sql::planner::SqlToRel::query_to_plan > 21: datafusion::sql::planner::SqlToRel::sql_statement_to_plan > 22: datafusion::sql::planner::SqlToRel::statement_to_plan > 23: datafusion::execution::context::ExecutionContext::create_logical_plan > ``` > This is happening because of an `unwrap` at > https://github.com/apache/arrow/blob/6cfbd22b457d873365fa60df31905857856608ee/rust/datafusion/src/sql/planner.rs#L652. > > Perhaps the error should be returned as the Result rather than panicking, so > the error can be handled? There are a number of other places in the planner > where `unwrap()` is used, so they may warrant similar treatment. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11586) [Rust] [Datafusion] Invalid SQL sometimes panics
[ https://issues.apache.org/jira/browse/ARROW-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb updated ARROW-11586: Component/s: Rust - DataFusion > [Rust] [Datafusion] Invalid SQL sometimes panics > > > Key: ARROW-11586 > URL: https://issues.apache.org/jira/browse/ARROW-11586 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Marc Prud'hommeaux >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Executing the invalid SQL "select 1 order by x" will panic rather returning > an Err: > ``` > thread '' panicked at 'called `Result::unwrap()` on an `Err` value: > Plan("Invalid identifier \'x\' for schema Int64(1)")', > /Users/marc/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/643f420/rust/datafusion/src/sql/planner.rs:649:76 > stack backtrace: >0: _rust_begin_unwind >1: core::panicking::panic_fmt >2: core::option::expect_none_failed >3: core::result::Result::unwrap >4: datafusion::sql::planner::SqlToRel::order_by::{{closure}} >5: core::iter::adapters::map_try_fold::{{closure}} >6: core::iter::traits::iterator::Iterator::try_fold >7: as > core::iter::traits::iterator::Iterator>::try_fold >8: as > core::iter::traits::iterator::Iterator>::try_fold >9: core::iter::traits::iterator::Iterator::find > 10: as > core::iter::traits::iterator::Iterator>::next > 11: as alloc::vec::SpecFromIterNested>::from_iter > 12: as alloc::vec::SpecFromIter>::from_iter > 13: as > core::iter::traits::collect::FromIterator>::from_iter > 14: core::iter::traits::iterator::Iterator::collect > 15: as > core::iter::traits::collect::FromIterator>>::from_iter::{{closure}} > 16: core::iter::adapters::process_results > 17: as > core::iter::traits::collect::FromIterator>>::from_iter > 18: core::iter::traits::iterator::Iterator::collect > 19: datafusion::sql::planner::SqlToRel::order_by > 20: datafusion::sql::planner::SqlToRel::query_to_plan > 21: datafusion::sql::planner::SqlToRel::sql_statement_to_plan > 22: datafusion::sql::planner::SqlToRel::statement_to_plan > 23: datafusion::execution::context::ExecutionContext::create_logical_plan > ``` > This is happening because of an `unwrap` at > https://github.com/apache/arrow/blob/6cfbd22b457d873365fa60df31905857856608ee/rust/datafusion/src/sql/planner.rs#L652. > > Perhaps the error should be returned as the Result rather than panicking, so > the error can be handled? There are a number of other places in the planner > where `unwrap()` is used, so they may warrant similar treatment. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza commented on ARROW-6154: --- I've come across the same issue. It appears to be in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` call here eventually fails as there are too many file handles open. > [Rust] [Parquet] Too many open files (os error 24) > -- > > Key: ARROW-6154 > URL: https://issues.apache.org/jira/browse/ARROW-6154 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Yesh >Priority: Major > > Used [rust]*parquet-read binary to read a deeply nested parquet file and see > the below stack trace. Unfortunately won't be able to upload file.* > {code:java} > stack backtrace: > 0: std::panicking::default_hook::{{closure}} > 1: std::panicking::default_hook > 2: std::panicking::rust_panic_with_hook > 3: std::panicking::continue_panic_fmt > 4: rust_begin_unwind > 5: core::panicking::panic_fmt > 6: core::result::unwrap_failed > 7: parquet::util::io::FileSource::new > 8: as > parquet::file::reader::RowGroupReader>::get_column_page_reader > 9: as > parquet::file::reader::RowGroupReader>::get_column_reader > 10: parquet::record::reader::TreeBuilder::reader_tree > 11: parquet::record::reader::TreeBuilder::reader_tree > 12: parquet::record::reader::TreeBuilder::reader_tree > 13: parquet::record::reader::TreeBuilder::reader_tree > 14: parquet::record::reader::TreeBuilder::reader_tree > 15: parquet::record::reader::TreeBuilder::build > 16: core::iter::traits::iterator::Iterator>::next > 17: parquet_read::main > 18: std::rt::lang_start::{{closure}} > 19: std::panicking::try::do_call > 20: __rust_maybe_catch_panic > 21: std::rt::lang_start_internal > 22: main{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:02 PM: -- I've come across the same issue. It appears to be in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` call here eventually fails as there are too many file handles open. Here's a stack trace from `gdb` which leads to the call in `io.rs`: {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/portfolio.parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same issue. It appears to be in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` call here eventually fails as there are too many file handles open. > [Rust] [Parquet] Too many open files (os error 24) > -- > > Key: ARROW-6154 > URL: https://issues.apache.org/jira/browse/ARROW-6154 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Yesh >Priority: Major > > Used [rust]*parquet-read binary to read a deeply nested parquet file and see > the below stack trace. Unfortunately won't be able to upload file.* > {code:java} > stack backtrace: > 0: std::panicking::default_hook::{{closure}} > 1: std::panicking::default_hook > 2: std::panicking::rust_panic_with_hook > 3: std::panicking::continue_panic_fmt > 4: rust_begin_unwind > 5: core::panicking::panic_fmt > 6: core::result::unwrap_failed > 7: parquet::util::io::FileSource::new > 8: as > parquet::file::reader::RowGroupReader>::get_column_page_reader > 9: as > parquet::file::reader::RowGroupReader>::get_column_reader > 10: parquet::record::reader::TreeBuilder::reader_tree > 11: parquet::record::reader::TreeBuilder::reader_tree > 12: parquet::record::reader::TreeBuilder::reader_tree > 13: parquet::record::reader::TreeBuilder::reader_tree > 14: parquet::record::reader::TreeBuilder::reader_tree > 15: parquet::record::reader::TreeBuilder::build > 16: core::iter::traits::iterator::Iterator>::next > 17: parquet_read::main > 18: std::rt::lang_start::{{closure}} > 19: std::panicking::try::do_call > 20: __rust_maybe_catch_panic > 21: std::rt::lang_start_internal > 22: main{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:03 PM: -- I've come across the same issue. It appears to be in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` call here eventually fails as there are too many file handles open. Here's a stack trace from `gdb` which leads to the call in `io.rs`: {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same issue. It appears to be in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` call here eventually fails as there are too many file handles open. Here's a stack trace from `gdb` which leads to the call in `io.rs`: {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/portfolio.parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} > [Rust] [Parquet] Too many open files (os error 24) > -- > > Key: ARROW-6154 > URL: https://issues.apache.org/jira/browse/ARROW-6154 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Yesh >Priority: Major > > Used [rust]*parquet-read binary to read a deeply nested parquet file and see > the below stack trace. Unfortunately won't be able to upload file.* > {code:java} > stack backtrace
[jira] [Updated] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Riza updated ARROW-6154: -- Attachment: part-9-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet > [Rust] [Parquet] Too many open files (os error 24) > -- > > Key: ARROW-6154 > URL: https://issues.apache.org/jira/browse/ARROW-6154 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Yesh >Priority: Major > Attachments: > part-9-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet > > > Used [rust]*parquet-read binary to read a deeply nested parquet file and see > the below stack trace. Unfortunately won't be able to upload file.* > {code:java} > stack backtrace: > 0: std::panicking::default_hook::{{closure}} > 1: std::panicking::default_hook > 2: std::panicking::rust_panic_with_hook > 3: std::panicking::continue_panic_fmt > 4: rust_begin_unwind > 5: core::panicking::panic_fmt > 6: core::result::unwrap_failed > 7: parquet::util::io::FileSource::new > 8: as > parquet::file::reader::RowGroupReader>::get_column_page_reader > 9: as > parquet::file::reader::RowGroupReader>::get_column_reader > 10: parquet::record::reader::TreeBuilder::reader_tree > 11: parquet::record::reader::TreeBuilder::reader_tree > 12: parquet::record::reader::TreeBuilder::reader_tree > 13: parquet::record::reader::TreeBuilder::reader_tree > 14: parquet::record::reader::TreeBuilder::reader_tree > 15: parquet::record::reader::TreeBuilder::build > 16: core::iter::traits::iterator::Iterator>::next > 17: parquet_read::main > 18: std::rt::lang_start::{{closure}} > 19: std::panicking::try::do_call > 20: __rust_maybe_catch_panic > 21: std::rt::lang_start_internal > 22: main{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:06 PM: -- I've come across the same issue. It appears to be in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` call here eventually fails as there are too many file handles open. Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same issue. It appears to be in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` call here eventually fails as there are too many file handles open. Here's a stack trace from `gdb` which leads to the call in `io.rs`: {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} > [Rust] [Parquet] Too many open files (os error 24) > -- > > Key: ARROW-6154 > URL: https://issues.apache.org/jira/browse/ARROW-6154 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Yesh >Priority: Major > Attachments: > part-9-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet > > > Used [rust]*parquet-read binary
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:08 PM: -- I've come across the same issue. It appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors (I'm running this on Linux). Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same issue. It appears to be in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` call here eventually fails as there are too many file handles open. Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} > [Rust] [Parquet] Too many open files (os error 24) > -- > > Key: ARROW-6154 > URL: https://issues.apache.org/jira/browse/ARROW-6154 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Yesh >Priority: Ma
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:09 PM: -- I've come across the same issue. It appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors (I'm running this on Linux, {color:#00}Fedora release 33 and rustc 1.50.0 (cb75ad5db 2021-02-10)).{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same issue. It appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors (I'm running this on Linux). Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} > [Rust] [Parquet] Too many open files (os error 24) > -- > > Key: ARROW-6154 > URL: https://issues.apache.org/jira/browse/
[jira] [Commented] (ARROW-9392) [C++] Document more of the compute layer
[ https://issues.apache.org/jira/browse/ARROW-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284029#comment-17284029 ] Aldrin commented on ARROW-9392: --- Hello [~apitrou]! I am interested in helping out with this, at *least* with the portions that I will be using significantly in the near future. I'm not sure there's much to do here, but I have just had trouble finding documentation myself and wanted to volunteer to contribute. (I posted to the mailing list in case a lot of this already exists: https://lists.apache.org/thread.html/rb0633480a9cf07d311d3a1143c2be1bce3a83e6ae5cf281ebb2cff9b%40%3Cdev.arrow.apache.org%3E) For reference, my usage of the APIs will be related to ARROW-10549, but with a different end goal. > [C++] Document more of the compute layer > > > Key: ARROW-9392 > URL: https://issues.apache.org/jira/browse/ARROW-9392 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Antoine Pitrou >Priority: Major > Fix For: 4.0.0 > > > Ideally, we should add: > * a description and examples of how to call compute functions > * an API reference for concrete C++ functions such as {{Cast}}, > {{NthToIndices}}, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-9392) [C++] Document more of the compute layer
[ https://issues.apache.org/jira/browse/ARROW-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284029#comment-17284029 ] Aldrin edited comment on ARROW-9392 at 2/12/21, 11:14 PM: -- Hello [~apitrou]! I am interested in helping out with this, at *least* with the portions that I will be using significantly in the near future. I figured pinging you first to orient myself made sense since you created this issue. I'm not sure how much there is to do here, but I have just had trouble finding documentation myself and wanted to volunteer to contribute. (I posted to the mailing list in case a lot of this already exists: [https://lists.apache.org/thread.html/rb0633480a9cf07d311d3a1143c2be1bce3a83e6ae5cf281ebb2cff9b%40%3Cdev.arrow.apache.org%3E]) For reference, my usage of the APIs will be related to ARROW-10549, but with a different end goal. Thanks! was (Author: octalene): Hello [~apitrou]! I am interested in helping out with this, at *least* with the portions that I will be using significantly in the near future. I'm not sure there's much to do here, but I have just had trouble finding documentation myself and wanted to volunteer to contribute. (I posted to the mailing list in case a lot of this already exists: https://lists.apache.org/thread.html/rb0633480a9cf07d311d3a1143c2be1bce3a83e6ae5cf281ebb2cff9b%40%3Cdev.arrow.apache.org%3E) For reference, my usage of the APIs will be related to ARROW-10549, but with a different end goal. > [C++] Document more of the compute layer > > > Key: ARROW-9392 > URL: https://issues.apache.org/jira/browse/ARROW-9392 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Antoine Pitrou >Priority: Major > Fix For: 4.0.0 > > > Ideally, we should add: > * a description and examples of how to call compute functions > * an API reference for concrete C++ functions such as {{Cast}}, > {{NthToIndices}}, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11566) [Python][Parquet] Use pypi condition package to filter partitions in a user friendly way
[ https://issues.apache.org/jira/browse/ARROW-11566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiyang Zhao updated ARROW-11566: - Description: I created the pypi condition package to allow user friendly expression of conditions. For example, a condition can be written as: (f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2'] where A, B, C are partition keys. For usage details, please see its document at: [https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#] Arbitrary condition objects can be converted to pyarrow's filter by calling its to_pyarrow_filter() method: [https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering] The above method will normalize the condition to conform to pyarrow filter specification. Furthermore, the condition object be directly used to evaluate partition paths. This can replace the current complex filtering codes. (both native and python) For max efficiency, filtering with the condition object can be done in the below ways: # read the paths in chunks to keep the memory footprint small; # parse the paths to be a pandas dataframe; # use condition.query(dataframe) to get the filtered dataframe of path. # use numexpr backend for dataframe query for efficiency. Please discuss. was: I created the pypi condition package to allow user friendly expression of conditions. For example, a condition can be: (A <= 3 or B != 'b1') and C == ['c1', 'c2'] For usage details, please see its document at: [https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#] Arbitrary condition objects can be converted to pyarrow's filter by calling its to_pyarrow_filter() method: [https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering] The above method will normalize the condition to conform to pyarrow filter specification. Furthermore, the condition object be directly used to evaluate partition paths. This can replace the current complex filtering codes. (both native and python) For max efficiency, filtering with the condition object can be done in the below ways: # read the paths in chunks to keep the memory footprint small; # parse the paths to be a pandas dataframe; # use condition.query(dataframe) to get the filtered dataframe of path. # use numexpr backend for dataframe query for efficiency. Please discuss. > [Python][Parquet] Use pypi condition package to filter partitions in a user > friendly way > > > Key: ARROW-11566 > URL: https://issues.apache.org/jira/browse/ARROW-11566 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weiyang Zhao >Assignee: Weiyang Zhao >Priority: Major > > I created the pypi condition package to allow user friendly expression of > conditions. For example, a condition can be written as: > (f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2'] > where A, B, C are partition keys. > For usage details, please see its document at: > [https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#] > > Arbitrary condition objects can be converted to pyarrow's filter by calling > its > to_pyarrow_filter() method: > [https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering] > The above method will normalize the condition to conform to pyarrow filter > specification. > > Furthermore, the condition object be directly used to evaluate partition > paths. This can replace the current complex filtering codes. (both native and > python) > For max efficiency, filtering with the condition object can be done in the > below ways: > # read the paths in chunks to keep the memory footprint small; > # parse the paths to be a pandas dataframe; > # use condition.query(dataframe) to get the filtered dataframe of path. > # use numexpr backend for dataframe query for efficiency. > Please discuss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:29 PM: -- I've come across the same issue. It appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors (I'm running this on Linux, {color:#00}Fedora release 33 and rustc 1.50.0 (cb75ad5db 2021-02-10)).{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same issue. It appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors (I'm running this on Linux, {color:#00}Fedora release 33 and rustc 1.50.0 (cb75ad5db 2021-02-10)).{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_file
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:31 PM: -- I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same error (potentially from a different location). In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:31 PM: -- I've come across the same error (potentially from a different location). In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same issue. It appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors (I'm running this on Linux, {color:#00}Fedora release 33 and rustc 1.50.0 (cb75ad5db 2021-02-10)).{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.co
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:32 PM: -- I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] In my case I have a Parquet file with 3000 columns, and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:46 PM: -- I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. This is the initial stack trace when the footer is first read. The code in `io.rs` gets called for every column subsequently as well when reading the columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`) {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_r
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:46 PM: -- I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. This is the initial stack trace when the footer is first read. The code in `FileSource::new` gets called for every column subsequently as well when reading the columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`) {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. This is the initial stack trace when the footer is first read. The code in `io.rs` gets called for every column subsequently as well when reading the columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`) {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:47 PM: -- I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. This is the initial stack trace when the footer is first read. `FileSource::new` (in io.rs) gets called for every column subsequently as well when reading the columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`) {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. This is the initial stack trace when the footer is first read. The code in `FileSource::new` gets called for every column subsequently as well when reading the columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`) {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serializ
[jira] [Created] (ARROW-11615) DataFusion does not support wasm32-unknown-unknown target
Dominik Moritz created ARROW-11615: -- Summary: DataFusion does not support wasm32-unknown-unknown target Key: ARROW-11615 URL: https://issues.apache.org/jira/browse/ARROW-11615 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Dominik Moritz The Arrow crate successfully compiles to WebAssembly (e.g. https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently does not support the`wasm32-unknown-unknown` target. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027 ] Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:59 PM: -- I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors (one for each column in the Parquet file). This is the initial stack trace when the footer is first read. `FileSource::new` (in io.rs) gets called for every column subsequently as well when reading the columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`) {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x5577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} was (Author: dr.r...@gmail.com): I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#00}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors. This is the initial stack trace when the footer is first read. `FileSource::new` (in io.rs) gets called for every column subsequently as well when reading the columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`) {code:java} #0 parquet::util::io::FileSource::new (fd=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x77c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x5590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x77c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x55845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc629
[jira] [Updated] (ARROW-11615) DataFusion does not support wasm32-unknown-unknown target
[ https://issues.apache.org/jira/browse/ARROW-11615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominik Moritz updated ARROW-11615: --- Description: The Arrow crate successfully compiles to WebAssembly (e.g. https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently does not support the`wasm32-unknown-unknown` target. Try out the repository at https://github.com/domoritz/datafusion-wasm/tree/73105fd1b2e3ca6c32ec4652c271fb741bda419a. {code} error[E0433]: failed to resolve: could not find `unix` in `os` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18 | 41 | use std::os::unix::ffi::OsStringExt; | could not find `unix` in `os` error[E0432]: unresolved import `unix` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5 | 6 | use unix; | no `unix` in the root error[E0433]: failed to resolve: use of undeclared crate or module `sys` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:98:9 | 98 | sys::duplicate(self) | ^^^ use of undeclared crate or module `sys` error[E0433]: failed to resolve: use of undeclared crate or module `sys` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:101:9 | 101 | sys::allocated_size(self) | ^^^ use of undeclared crate or module `sys` error[E0433]: failed to resolve: use of undeclared crate or module `sys` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:104:9 | 104 | sys::allocate(self, len) | ^^^ use of undeclared crate or module `sys` error[E0433]: failed to resolve: use of undeclared crate or module `sys` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:107:9 | 107 | sys::lock_shared(self) | ^^^ use of undeclared crate or module `sys` error[E0433]: failed to resolve: use of undeclared crate or module `sys` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:110:9 | 110 | sys::lock_exclusive(self) | ^^^ use of undeclared crate or module `sys` error[E0433]: failed to resolve: use of undeclared crate or module `sys` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:113:9 | 113 | sys::try_lock_shared(self) | ^^^ use of undeclared crate or module `sys` error[E0433]: failed to resolve: use of undeclared crate or module `sys` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:116:9 | 116 | sys::try_lock_exclusive(self) | ^^^ use of undeclared crate or module `sys` error[E0433]: failed to resolve: use of undeclared crate or module `sys` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:119:9 | 119 | sys::unlock(self) | ^^^ use of undeclared crate or module `sys` error[E0433]: failed to resolve: use of undeclared crate or module `sys` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:126:5 | 126 | sys::lock_error() | ^^^ use of undeclared crate or module `sys` error[E0433]: failed to resolve: use of undeclared crate or module `sys` --> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:169:5 | 169 | sys::statvfs(path.as_ref()) | ^^^ use of undeclared crate or module `sys` Compiling num-rational v0.3.2 error: aborting due to 10 previous errors {code} was:The Arrow crate successfully compiles to WebAssembly (e.g. https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently does not support the`wasm32-unknown-unknown` target. > DataFusion does not support wasm32-unknown-unknown target > - > > Key: ARROW-11615 > URL: https://issues.apache.org/jira/browse/ARROW-11615 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Dominik Moritz >Priority: Major > > The Arrow crate successfully compiles to WebAssembly (e.g. > https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently > does not support the`wasm32-unknown-unknown` target. > Try out the repository at > https://github.com/domoritz/datafusion-wasm/tree/73105fd1b2e3ca6c32ec4652c271fb741bda419a. > > {code} > error[E0433]: failed to resolve: could not find `unix` in `os` > --> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18 >| > 41 | use std::os::unix::ffi::OsStringExt; >| could no
[jira] [Updated] (ARROW-11593) [Rust] Parquet does not support wasm32-unknown-unknown target
[ https://issues.apache.org/jira/browse/ARROW-11593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominik Moritz updated ARROW-11593: --- Summary: [Rust] Parquet does not support wasm32-unknown-unknown target (was: Parquet does not support wasm32-unknown-unknown target) > [Rust] Parquet does not support wasm32-unknown-unknown target > - > > Key: ARROW-11593 > URL: https://issues.apache.org/jira/browse/ARROW-11593 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Dominik Moritz >Priority: Major > > The Arrow crate successfully compiles to WebAssembly (e.g. > https://github.com/domoritz/arrow-wasm) but the Parquet crate currently does > not support the`wasm32-unknown-unknown` target. > Try out the repository at > https://github.com/domoritz/parquet-wasm/commit/e877f9ad9c45c09f73d98fab2a8ad384a802b2e0. > The problem seems to be in liblz4, even if I do not include lz4 in the > feature flags. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11615) [Rust] DataFusion does not support wasm32-unknown-unknown target
[ https://issues.apache.org/jira/browse/ARROW-11615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominik Moritz updated ARROW-11615: --- Summary: [Rust] DataFusion does not support wasm32-unknown-unknown target (was: DataFusion does not support wasm32-unknown-unknown target) > [Rust] DataFusion does not support wasm32-unknown-unknown target > > > Key: ARROW-11615 > URL: https://issues.apache.org/jira/browse/ARROW-11615 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Dominik Moritz >Priority: Major > > The Arrow crate successfully compiles to WebAssembly (e.g. > https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently > does not support the`wasm32-unknown-unknown` target. > Try out the repository at > https://github.com/domoritz/datafusion-wasm/tree/73105fd1b2e3ca6c32ec4652c271fb741bda419a. > > {code} > error[E0433]: failed to resolve: could not find `unix` in `os` > --> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18 >| > 41 | use std::os::unix::ffi::OsStringExt; >| could not find `unix` in `os` > error[E0432]: unresolved import `unix` > --> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5 > | > 6 | use unix; > | no `unix` in the root > error[E0433]: failed to resolve: use of undeclared crate or module `sys` > --> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:98:9 >| > 98 | sys::duplicate(self) >| ^^^ use of undeclared crate or module `sys` > error[E0433]: failed to resolve: use of undeclared crate or module `sys` >--> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:101:9 > | > 101 | sys::allocated_size(self) > | ^^^ use of undeclared crate or module `sys` > error[E0433]: failed to resolve: use of undeclared crate or module `sys` >--> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:104:9 > | > 104 | sys::allocate(self, len) > | ^^^ use of undeclared crate or module `sys` > error[E0433]: failed to resolve: use of undeclared crate or module `sys` >--> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:107:9 > | > 107 | sys::lock_shared(self) > | ^^^ use of undeclared crate or module `sys` > error[E0433]: failed to resolve: use of undeclared crate or module `sys` >--> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:110:9 > | > 110 | sys::lock_exclusive(self) > | ^^^ use of undeclared crate or module `sys` > error[E0433]: failed to resolve: use of undeclared crate or module `sys` >--> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:113:9 > | > 113 | sys::try_lock_shared(self) > | ^^^ use of undeclared crate or module `sys` > error[E0433]: failed to resolve: use of undeclared crate or module `sys` >--> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:116:9 > | > 116 | sys::try_lock_exclusive(self) > | ^^^ use of undeclared crate or module `sys` > error[E0433]: failed to resolve: use of undeclared crate or module `sys` >--> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:119:9 > | > 119 | sys::unlock(self) > | ^^^ use of undeclared crate or module `sys` > error[E0433]: failed to resolve: use of undeclared crate or module `sys` >--> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:126:5 > | > 126 | sys::lock_error() > | ^^^ use of undeclared crate or module `sys` > error[E0433]: failed to resolve: use of undeclared crate or module `sys` >--> > /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:169:5 > | > 169 | sys::statvfs(path.as_ref()) > | ^^^ use of undeclared crate or module `sys` >Compiling num-rational v0.3.2 > error: aborting due to 10 previous errors > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11616) [Rust][DataFusion] Expose collect_partitioned for DataFrame
Mike Seddon created ARROW-11616: --- Summary: [Rust][DataFusion] Expose collect_partitioned for DataFrame Key: ARROW-11616 URL: https://issues.apache.org/jira/browse/ARROW-11616 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon The DataFrame API has a `collect` method which invokes the `collect(plan: Arc) -> Result>` function which will collect records into a single vector of RecordBatches removing the partitioning via `MergeExec`. The DataFrame should also expose the `collect_partitioned` method so that partitions can be maintained. ``` collect_partitioned( plan: Arc, ) -> Result>> ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11616) [Rust][DataFusion] Expose collect_partitioned for DataFrame
[ https://issues.apache.org/jira/browse/ARROW-11616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11616: --- Labels: pull-request-available (was: ) > [Rust][DataFusion] Expose collect_partitioned for DataFrame > --- > > Key: ARROW-11616 > URL: https://issues.apache.org/jira/browse/ARROW-11616 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Mike Seddon >Assignee: Mike Seddon >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The DataFrame API has a `collect` method which invokes the `collect(plan: > Arc) -> Result>` function which will > collect records into a single vector of RecordBatches removing the > partitioning via `MergeExec`. > The DataFrame should also expose the `collect_partitioned` method so that > partitions can be maintained. > ``` > collect_partitioned( > plan: Arc, > ) -> Result>> > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284083#comment-17284083 ] Micah Kornfield commented on ARROW-11497: - My thought: I think the short term we can expose the flag. We can figure out a longer term plan for migrating all users to a conformant writer/reader. [~trucnguyenlam] do you want to to provide a PR? > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11617) [C++][Gandiva] Fix nested if-else optimisation in gandiva
Projjal Chanda created ARROW-11617: -- Summary: [C++][Gandiva] Fix nested if-else optimisation in gandiva Key: ARROW-11617 URL: https://issues.apache.org/jira/browse/ARROW-11617 Project: Apache Arrow Issue Type: Bug Reporter: Projjal Chanda Assignee: Projjal Chanda {color:#1d1c1d}In gandiva, when we have nested if-else statements we reuse the local bitmap and treat it is a single logical if - elseif - .. - --else condition. However, when he have say another function between them like{color} {color:#1d1c1d}IF{color} {color:#1d1c1d}THEN{color} {color:#1d1c1d}ELSE{color} {color:#1d1c1d}function({color} {color:#1d1c1d}IF{color} {color:#1d1c1d}THEN{color} {color:#1d1c1d}ELSE{color} {color:#1d1c1d}){color} {color:#1d1c1d}in such cases also currently we are doing same thing which can lead to incorrect results{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11617) [C++][Gandiva] Fix nested if-else optimisation in gandiva
[ https://issues.apache.org/jira/browse/ARROW-11617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11617: --- Labels: pull-request-available (was: ) > [C++][Gandiva] Fix nested if-else optimisation in gandiva > - > > Key: ARROW-11617 > URL: https://issues.apache.org/jira/browse/ARROW-11617 > Project: Apache Arrow > Issue Type: Bug >Reporter: Projjal Chanda >Assignee: Projjal Chanda >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {color:#1d1c1d}In gandiva, when we have nested if-else statements we reuse > the local bitmap and treat it is a single logical if - elseif - .. - --else > condition. However, when he have say another function between them like{color} > {color:#1d1c1d}IF{color} > {color:#1d1c1d}THEN{color} > {color:#1d1c1d}ELSE{color} > {color:#1d1c1d}function({color} > {color:#1d1c1d}IF{color} > {color:#1d1c1d}THEN{color} > {color:#1d1c1d}ELSE{color} > {color:#1d1c1d}){color} > {color:#1d1c1d}in such cases also currently we are doing same thing which can > lead to incorrect results{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)