[jira] [Resolved] (ARROW-11563) [Rust] Support Cast(Utf8, TimeStamp(Nanoseconds, None))

2021-02-12 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11563.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9449
[https://github.com/apache/arrow/pull/9449]

> [Rust] Support Cast(Utf8, TimeStamp(Nanoseconds, None))
> ---
>
> Key: ARROW-11563
> URL: https://issues.apache.org/jira/browse/ARROW-11563
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Patsura Dmitry
>Assignee: Patsura Dmitry
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11557) [Rust] Add table de-registration to DataFusion ExecutionContext

2021-02-12 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-11557:

Component/s: Rust

> [Rust] Add table de-registration to DataFusion ExecutionContext
> ---
>
> Key: ARROW-11557
> URL: https://issues.apache.org/jira/browse/ARROW-11557
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Marc Prud'hommeaux
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Table de-registration, as discussed at 
> https://lists.apache.org/thread.html/r0b3bc62a720c204c5bbe26d8157963276f7d61c05fcbad7eaf2ae9ff%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11557) [Rust] Add table de-registration to DataFusion ExecutionContext

2021-02-12 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11557.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9445
[https://github.com/apache/arrow/pull/9445]

> [Rust] Add table de-registration to DataFusion ExecutionContext
> ---
>
> Key: ARROW-11557
> URL: https://issues.apache.org/jira/browse/ARROW-11557
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Marc Prud'hommeaux
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Table de-registration, as discussed at 
> https://lists.apache.org/thread.html/r0b3bc62a720c204c5bbe26d8157963276f7d61c05fcbad7eaf2ae9ff%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11607) [Python] Error when reading table with list values from parquet

2021-02-12 Thread Michal Glaus (Jira)
Michal Glaus created ARROW-11607:


 Summary: [Python] Error when reading table with list values from 
parquet
 Key: ARROW-11607
 URL: https://issues.apache.org/jira/browse/ARROW-11607
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 3.0.0, 2.0.0, 1.0.1, 1.0.0
 Environment: Python 3.7
Reporter: Michal Glaus


I'm getting unexpected results when reading tables containing list values and a 
large number of rows from a parquet file.

Example code (pyarrow 2.0.0 and 3.0.0):
{code:java}
from pyarrow import parquet, Table

data = [None] * (1 << 20)
data.append([1])

table = Table.from_arrays([data], ['column'])
print('Expected: %s' % table['column'][-1])

parquet.write_table(table, 'table.parquet')

table2 = parquet.read_table('table.parquet')
print('Actual:   %s' % table2['column'][-1]{code}
Output:
{noformat}
Expected: [1]
Actual:   [0]{noformat}
When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:
{noformat}
Expected: [1]
Actual:   [1]{noformat}

For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.

It seems that this is caused by some overflow and memory corruption because in 
pyarrow 3.0.0 with more complex values (list of dictionaries with float and 
datetime):
{noformat}
data.append([{'a': 0.1, 'b': datetime.now()}])
{noformat}
I'm getting this exception after calling table2.to_pandas() :
{noformat}
/arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default 
memory pool{noformat}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11539) [Developer][Archery] Change items_per_seconds units

2021-02-12 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-11539.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9433
[https://github.com/apache/arrow/pull/9433]

> [Developer][Archery] Change items_per_seconds units
> ---
>
> Key: ARROW-11539
> URL: https://issues.apache.org/jira/browse/ARROW-11539
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery, Developer Tools
>Reporter: Diana Clarke
>Assignee: Diana Clarke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Antoine requested that I change the units in {{items_per_seconds_fmt}} to be:
> - K items/sec
> - M items/sec
> - G items/sec
> Rather than:
> - k items/sec
> - m items/sec
> - b items/sec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11608) [CI] turbodbc integration tests are failing (build isue)

2021-02-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-11608:
-

 Summary: [CI] turbodbc integration tests are failing (build isue)
 Key: ARROW-11608
 URL: https://issues.apache.org/jira/browse/ARROW-11608
 Project: Apache Arrow
  Issue Type: Improvement
  Components: CI
Reporter: Joris Van den Bossche


Both turbodbc builds are failing, see eg 
https://github.com/ursacomputing/crossbow/runs/1885201762

It seems a failure to build turbodbc: 

{code}
/build/turbodbc /
-- The CXX compiler identification is GNU 9.3.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: 
/opt/conda/envs/arrow/bin/x86_64-conda-linux-gnu-c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Build type: Debug
CMake Error at CMakeLists.txt:14 (add_subdirectory):
  add_subdirectory given source "pybind11" which is not an existing
  directory.


-- Found GTest: /opt/conda/envs/arrow/lib/libgtest.so  
-- Found Boost: /opt/conda/envs/arrow/include (found version "1.74.0") found 
components: locale 
-- Detecting unixODBC library
--   Found header files at: /opt/conda/envs/arrow/include
--   Found library at: /opt/conda/envs/arrow/lib/libodbc.so
-- Found Boost: /opt/conda/envs/arrow/include (found version "1.74.0") found 
components: system date_time locale 
-- Detecting unixODBC library
--   Found header files at: /opt/conda/envs/arrow/include
--   Found library at: /opt/conda/envs/arrow/lib/libodbc.so
-- Found Boost: /opt/conda/envs/arrow/include (found version "1.74.0") found 
components: system 
-- Detecting unixODBC library
--   Found header files at: /opt/conda/envs/arrow/include
--   Found library at: /opt/conda/envs/arrow/lib/libodbc.so
CMake Error at cpp/turbodbc_python/Library/CMakeLists.txt:3 
(pybind11_add_module):
  Unknown CMake command "pybind11_add_module".


-- Configuring incomplete, errors occurred!
See also "/build/turbodbc/CMakeFiles/CMakeOutput.log".
See also "/build/turbodbc/CMakeFiles/CMakeError.log".
1
Error: `docker-compose --file 
/home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e 
SETUPTOOLS_SCM_PRETEND_VERSION=3.1.0.dev174 conda-python-turbodbc` exited with 
a non-zero exit code 1, see the process log above.
{code}

cc [~uwe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11586) [Rust] [Datafusion] Invalid SQL sometimes panics

2021-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11586:
---
Labels: pull-request-available  (was: )

> [Rust] [Datafusion] Invalid SQL sometimes panics
> 
>
> Key: ARROW-11586
> URL: https://issues.apache.org/jira/browse/ARROW-11586
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Marc Prud'hommeaux
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Executing the invalid SQL "select 1 order by x" will panic rather returning 
> an Err:
>  ```
> thread '' panicked at 'called `Result::unwrap()` on an `Err` value: 
> Plan("Invalid identifier \'x\' for schema Int64(1)")', 
> /Users/marc/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/643f420/rust/datafusion/src/sql/planner.rs:649:76
> stack backtrace:
>0: _rust_begin_unwind
>1: core::panicking::panic_fmt
>2: core::option::expect_none_failed
>3: core::result::Result::unwrap
>4: datafusion::sql::planner::SqlToRel::order_by::{{closure}}
>5: core::iter::adapters::map_try_fold::{{closure}}
>6: core::iter::traits::iterator::Iterator::try_fold
>7:  as 
> core::iter::traits::iterator::Iterator>::try_fold
>8:  as 
> core::iter::traits::iterator::Iterator>::try_fold
>9: core::iter::traits::iterator::Iterator::find
>   10:  as 
> core::iter::traits::iterator::Iterator>::next
>   11:  as alloc::vec::SpecFromIterNested>::from_iter
>   12:  as alloc::vec::SpecFromIter>::from_iter
>   13:  as 
> core::iter::traits::collect::FromIterator>::from_iter
>   14: core::iter::traits::iterator::Iterator::collect
>   15:  as 
> core::iter::traits::collect::FromIterator>>::from_iter::{{closure}}
>   16: core::iter::adapters::process_results
>   17:  as 
> core::iter::traits::collect::FromIterator>>::from_iter
>   18: core::iter::traits::iterator::Iterator::collect
>   19: datafusion::sql::planner::SqlToRel::order_by
>   20: datafusion::sql::planner::SqlToRel::query_to_plan
>   21: datafusion::sql::planner::SqlToRel::sql_statement_to_plan
>   22: datafusion::sql::planner::SqlToRel::statement_to_plan
>   23: datafusion::execution::context::ExecutionContext::create_logical_plan
> ```
> This is happening because of an `unwrap` at 
> https://github.com/apache/arrow/blob/6cfbd22b457d873365fa60df31905857856608ee/rust/datafusion/src/sql/planner.rs#L652.
>  
> Perhaps the error should be returned as the Result rather than panicking, so 
> the error can be handled? There are a number of other places in the planner 
> where `unwrap()` is used, so they may warrant similar treatment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-12 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283697#comment-17283697
 ] 

Joris Van den Bossche commented on ARROW-11456:
---

bq. Note that you may be able to do the conversion manually and force a Arrow 
large_string type, though I'm not sure Pandas allows that. 

Yes, pandas allows that by specifying a pyarrow schema manually (instead of 
letting pyarrow infer that from the dataframe).

For the example above, that would look like:

{code}
df.to_parquet(out, engine="pyarrow", compression="lz4", index=False, 
schema=pa.schema([("s", pa.large_string())]))
{code}

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading or writing a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the last line of the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-12 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283697#comment-17283697
 ] 

Joris Van den Bossche edited comment on ARROW-11456 at 2/12/21, 1:44 PM:
-

bq. Note that you may be able to do the conversion manually and force a Arrow 
large_string type, though I'm not sure Pandas allows that. 

Yes, pandas allows that by specifying a pyarrow schema manually (instead of 
letting pyarrow infer that from the dataframe).

For the example above, that would look like:

{code}
df.to_parquet(out, engine="pyarrow", compression="lz4", index=False, 
schema=pa.schema([("s", pa.large_string())]))
{code}


[~apacman] does that help as a work-around?


was (Author: jorisvandenbossche):
bq. Note that you may be able to do the conversion manually and force a Arrow 
large_string type, though I'm not sure Pandas allows that. 

Yes, pandas allows that by specifying a pyarrow schema manually (instead of 
letting pyarrow infer that from the dataframe).

For the example above, that would look like:

{code}
df.to_parquet(out, engine="pyarrow", compression="lz4", index=False, 
schema=pa.schema([("s", pa.large_string())]))
{code}

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading or writing a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the last line of the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11596) [Python][Dataset] SIGSEGV when executing scan tasks with Python executors

2021-02-12 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-11596:
-
Summary: [Python][Dataset] SIGSEGV when executing scan tasks with Python 
executors  (was: [C++][Python][Dataset] SIGSEGV when executing scan tasks with 
Python executors)

> [Python][Dataset] SIGSEGV when executing scan tasks with Python executors
> -
>
> Key: ARROW-11596
> URL: https://issues.apache.org/jira/browse/ARROW-11596
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: dataset, datasets
>
> This crashes for me with a segfault:
> {code:python}
> import concurrent.futures
> import queue
> import numpy as np
> import pyarrow as pa
> import pyarrow.dataset as ds
> import pyarrow.fs as fs
> import pyarrow.parquet as pq
> schema = pa.schema([("foo", pa.float64())])
> table = pa.table([np.random.uniform(size=1024)], schema=schema)
> path = "/tmp/foo.parquet"
> pq.write_table(table, path)
> dataset = pa.dataset.FileSystemDataset.from_paths(
> [path],
> schema=schema,
> format=ds.ParquetFileFormat(),
> filesystem=fs.LocalFileSystem(),
> )
> with concurrent.futures.ThreadPoolExecutor(2) as executor:
> tasks = dataset.scan()
> q = queue.Queue()
> def _prebuffer():
> for task in tasks:
> iterator = task.execute()
> next(iterator)
> q.put(iterator)
> executor.submit(_prebuffer).result()
> next(q.get())
> {code}
> {noformat}
> $ uname -a
> Linux chaconne 5.10.4-arch2-1 #1 SMP PREEMPT Fri, 01 Jan 2021 05:29:53 + 
> x86_64 GNU/Linux
> $ pip freeze
> numpy==1.20.1
> pyarrow==3.0.0
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11596) [C++][Python][Dataset] SIGSEGV when executing scan tasks with Python executors

2021-02-12 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-11596:
-
Component/s: (was: C++)

> [C++][Python][Dataset] SIGSEGV when executing scan tasks with Python executors
> --
>
> Key: ARROW-11596
> URL: https://issues.apache.org/jira/browse/ARROW-11596
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: dataset, datasets
>
> This crashes for me with a segfault:
> {code:python}
> import concurrent.futures
> import queue
> import numpy as np
> import pyarrow as pa
> import pyarrow.dataset as ds
> import pyarrow.fs as fs
> import pyarrow.parquet as pq
> schema = pa.schema([("foo", pa.float64())])
> table = pa.table([np.random.uniform(size=1024)], schema=schema)
> path = "/tmp/foo.parquet"
> pq.write_table(table, path)
> dataset = pa.dataset.FileSystemDataset.from_paths(
> [path],
> schema=schema,
> format=ds.ParquetFileFormat(),
> filesystem=fs.LocalFileSystem(),
> )
> with concurrent.futures.ThreadPoolExecutor(2) as executor:
> tasks = dataset.scan()
> q = queue.Queue()
> def _prebuffer():
> for task in tasks:
> iterator = task.execute()
> next(iterator)
> q.put(iterator)
> executor.submit(_prebuffer).result()
> next(q.get())
> {code}
> {noformat}
> $ uname -a
> Linux chaconne 5.10.4-arch2-1 #1 SMP PREEMPT Fri, 01 Jan 2021 05:29:53 + 
> x86_64 GNU/Linux
> $ pip freeze
> numpy==1.20.1
> pyarrow==3.0.0
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11596) [Python][Dataset] SIGSEGV when executing scan tasks with Python executors

2021-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11596:
---
Labels: dataset datasets pull-request-available  (was: dataset datasets)

> [Python][Dataset] SIGSEGV when executing scan tasks with Python executors
> -
>
> Key: ARROW-11596
> URL: https://issues.apache.org/jira/browse/ARROW-11596
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: dataset, datasets, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This crashes for me with a segfault:
> {code:python}
> import concurrent.futures
> import queue
> import numpy as np
> import pyarrow as pa
> import pyarrow.dataset as ds
> import pyarrow.fs as fs
> import pyarrow.parquet as pq
> schema = pa.schema([("foo", pa.float64())])
> table = pa.table([np.random.uniform(size=1024)], schema=schema)
> path = "/tmp/foo.parquet"
> pq.write_table(table, path)
> dataset = pa.dataset.FileSystemDataset.from_paths(
> [path],
> schema=schema,
> format=ds.ParquetFileFormat(),
> filesystem=fs.LocalFileSystem(),
> )
> with concurrent.futures.ThreadPoolExecutor(2) as executor:
> tasks = dataset.scan()
> q = queue.Queue()
> def _prebuffer():
> for task in tasks:
> iterator = task.execute()
> next(iterator)
> q.put(iterator)
> executor.submit(_prebuffer).result()
> next(q.get())
> {code}
> {noformat}
> $ uname -a
> Linux chaconne 5.10.4-arch2-1 #1 SMP PREEMPT Fri, 01 Jan 2021 05:29:53 + 
> x86_64 GNU/Linux
> $ pip freeze
> numpy==1.20.1
> pyarrow==3.0.0
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11586) [Rust] [Datafusion] Invalid SQL sometimes panics

2021-02-12 Thread Marc Prud'hommeaux (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283728#comment-17283728
 ] 

Marc Prud'hommeaux commented on ARROW-11586:


Unless there is some specific reason to panic there, replacing the `.unwrap()` 
with `?` fixes the issue: https://github.com/apache/arrow/pull/9479/files. I 
wonder if the other `unwrap()` instances in that module could similarly be 
turned into Result?

> [Rust] [Datafusion] Invalid SQL sometimes panics
> 
>
> Key: ARROW-11586
> URL: https://issues.apache.org/jira/browse/ARROW-11586
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Marc Prud'hommeaux
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Executing the invalid SQL "select 1 order by x" will panic rather returning 
> an Err:
>  ```
> thread '' panicked at 'called `Result::unwrap()` on an `Err` value: 
> Plan("Invalid identifier \'x\' for schema Int64(1)")', 
> /Users/marc/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/643f420/rust/datafusion/src/sql/planner.rs:649:76
> stack backtrace:
>0: _rust_begin_unwind
>1: core::panicking::panic_fmt
>2: core::option::expect_none_failed
>3: core::result::Result::unwrap
>4: datafusion::sql::planner::SqlToRel::order_by::{{closure}}
>5: core::iter::adapters::map_try_fold::{{closure}}
>6: core::iter::traits::iterator::Iterator::try_fold
>7:  as 
> core::iter::traits::iterator::Iterator>::try_fold
>8:  as 
> core::iter::traits::iterator::Iterator>::try_fold
>9: core::iter::traits::iterator::Iterator::find
>   10:  as 
> core::iter::traits::iterator::Iterator>::next
>   11:  as alloc::vec::SpecFromIterNested>::from_iter
>   12:  as alloc::vec::SpecFromIter>::from_iter
>   13:  as 
> core::iter::traits::collect::FromIterator>::from_iter
>   14: core::iter::traits::iterator::Iterator::collect
>   15:  as 
> core::iter::traits::collect::FromIterator>>::from_iter::{{closure}}
>   16: core::iter::adapters::process_results
>   17:  as 
> core::iter::traits::collect::FromIterator>>::from_iter
>   18: core::iter::traits::iterator::Iterator::collect
>   19: datafusion::sql::planner::SqlToRel::order_by
>   20: datafusion::sql::planner::SqlToRel::query_to_plan
>   21: datafusion::sql::planner::SqlToRel::sql_statement_to_plan
>   22: datafusion::sql::planner::SqlToRel::statement_to_plan
>   23: datafusion::execution::context::ExecutionContext::create_logical_plan
> ```
> This is happening because of an `unwrap` at 
> https://github.com/apache/arrow/blob/6cfbd22b457d873365fa60df31905857856608ee/rust/datafusion/src/sql/planner.rs#L652.
>  
> Perhaps the error should be returned as the Result rather than panicking, so 
> the error can be handled? There are a number of other places in the planner 
> where `unwrap()` is used, so they may warrant similar treatment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-12 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283731#comment-17283731
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

[~apitrou] [~emkornfield] I think we can make a final decision on this, I'm ok 
with the option that end users have some level of control to preserve the 
behaviour.

Please let me know your thoughts, thanks :)

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction

2021-02-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283766#comment-17283766
 ] 

Andy Grove commented on ARROW-11606:


I understand the issue better now.

In the DataFusion planner, the aggregate expressions are compiled against the 
schema of the input to the partial aggregate. These compiled expressions are 
then used to construct both the partial and final aggregates.

In other words, the expressions for the Final aggregate are not compiled 
against it's input schema, but against the input schema of the Partial 
aggregate.

This feels a little unnatural when implementing serde but I will think about 
this more and see how I can work around this.

 

 

> [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction
> -
>
> Key: ARROW-11606
> URL: https://issues.apache.org/jira/browse/ARROW-11606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> We have run into an issue in the Ballista project where we are reconstructing 
> the Final and Partial HashAggregateExec operators [1] for distributed 
> execution and we need some guidance.
> The Partial HashAggregateExec gets created OK and executes correctly.
> However, when we create the Final HashAggregateExec, it is not finding the 
> expected schema in the input operator. The partial exec outputs field names 
> ending with "[sum]" and "[count]" and so on but the final aggregate doesn't 
> seem to be looking for those names.
> It is also worth noting that the Final and Partial executors are not 
> connected directly in this usage.
> The Partial exec is executed and output streamed to disk.
> The Final exec then runs against the output from the Partial exec.
> We may need to make changes in DataFusion to allow other crates to support 
> this kind of use case?
>  [1] https://github.com/ballista-compute/ballista/pull/491
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11601) [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions

2021-02-12 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-11601:
-
Description: This can help performance on high-latency filesystems.  (was: 
This can help performance on high-latency filesystems. However, some care will 
be needed as then we won't be able to create one Arrow reader per Parquet row 
group anymore.)

> [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions
> -
>
> Key: ARROW-11601
> URL: https://issues.apache.org/jira/browse/ARROW-11601
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 3.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: dataset, datasets
>
> This can help performance on high-latency filesystems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction

2021-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11606:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction
> -
>
> Key: ARROW-11606
> URL: https://issues.apache.org/jira/browse/ARROW-11606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We have run into an issue in the Ballista project where we are reconstructing 
> the Final and Partial HashAggregateExec operators [1] for distributed 
> execution and we need some guidance.
> The Partial HashAggregateExec gets created OK and executes correctly.
> However, when we create the Final HashAggregateExec, it is not finding the 
> expected schema in the input operator. The partial exec outputs field names 
> ending with "[sum]" and "[count]" and so on but the final aggregate doesn't 
> seem to be looking for those names.
> It is also worth noting that the Final and Partial executors are not 
> connected directly in this usage.
> The Partial exec is executed and output streamed to disk.
> The Final exec then runs against the output from the Partial exec.
> We may need to make changes in DataFusion to allow other crates to support 
> this kind of use case?
>  [1] https://github.com/ballista-compute/ballista/pull/491
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11601) [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions

2021-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11601:
---
Labels: dataset datasets pull-request-available  (was: dataset datasets)

> [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions
> -
>
> Key: ARROW-11601
> URL: https://issues.apache.org/jira/browse/ARROW-11601
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 3.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: dataset, datasets, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This can help performance on high-latency filesystems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11607) [Python] Error when reading table with list values from parquet

2021-02-12 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11607:
--
Fix Version/s: 4.0.0

> [Python] Error when reading table with list values from parquet
> ---
>
> Key: ARROW-11607
> URL: https://issues.apache.org/jira/browse/ARROW-11607
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
> Environment: Python 3.7
>Reporter: Michal Glaus
>Priority: Major
> Fix For: 4.0.0
>
>
> I'm getting unexpected results when reading tables containing list values and 
> a large number of rows from a parquet file.
> Example code (pyarrow 2.0.0 and 3.0.0):
> {code:java}
> from pyarrow import parquet, Table
> data = [None] * (1 << 20)
> data.append([1])
> table = Table.from_arrays([data], ['column'])
> print('Expected: %s' % table['column'][-1])
> parquet.write_table(table, 'table.parquet')
> table2 = parquet.read_table('table.parquet')
> print('Actual:   %s' % table2['column'][-1]{code}
> Output:
> {noformat}
> Expected: [1]
> Actual:   [0]{noformat}
> When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:
> {noformat}
> Expected: [1]
> Actual:   [1]{noformat}
> For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.
> It seems that this is caused by some overflow and memory corruption because 
> in pyarrow 3.0.0 with more complex values (list of dictionaries with float 
> and datetime):
> {noformat}
> data.append([{'a': 0.1, 'b': datetime.now()}])
> {noformat}
> I'm getting this exception after calling table2.to_pandas() :
> {noformat}
> /arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create 
> default memory pool{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11609) [C++][Docs] Trivial CMake dependency on Arrow fails at link stage

2021-02-12 Thread David Li (Jira)
David Li created ARROW-11609:


 Summary: [C++][Docs] Trivial CMake dependency on Arrow fails at 
link stage
 Key: ARROW-11609
 URL: https://issues.apache.org/jira/browse/ARROW-11609
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Affects Versions: 3.0.0
Reporter: David Li


The example in the docs here isn't sufficient: 
[https://arrow.apache.org/docs/cpp/cmake.html] 

It fails at link time because Arrow's transitive dependencies aren't included 
in the INTERFACE_LINK_LIBRARIES:
{noformat}
/usr/bin/ld: warning: libglog.so.0, needed by 
/home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try using 
-rpath or -rpath-link)
/usr/bin/ld: warning: libutf8proc.so.2, needed by 
/home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try using 
-rpath or -rpath-link)
/usr/bin/ld: warning: libaws-cpp-sdk-config.so, needed by 
/home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try using 
-rpath or -rpath-link)
# ...{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-11609) [C++][Docs] Trivial CMake dependency on Arrow fails at link stage

2021-02-12 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li closed ARROW-11609.

Resolution: Not A Problem

> [C++][Docs] Trivial CMake dependency on Arrow fails at link stage
> -
>
> Key: ARROW-11609
> URL: https://issues.apache.org/jira/browse/ARROW-11609
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Affects Versions: 3.0.0
>Reporter: David Li
>Priority: Major
>
> The example in the docs here isn't sufficient: 
> [https://arrow.apache.org/docs/cpp/cmake.html] 
> It fails at link time because Arrow's transitive dependencies aren't included 
> in the INTERFACE_LINK_LIBRARIES:
> {noformat}
> /usr/bin/ld: warning: libglog.so.0, needed by 
> /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try 
> using -rpath or -rpath-link)
> /usr/bin/ld: warning: libutf8proc.so.2, needed by 
> /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try 
> using -rpath or -rpath-link)
> /usr/bin/ld: warning: libaws-cpp-sdk-config.so, needed by 
> /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try 
> using -rpath or -rpath-link)
> # ...{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11609) [C++][Docs] Trivial CMake dependency on Arrow fails at link stage

2021-02-12 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283901#comment-17283901
 ] 

David Li commented on ARROW-11609:
--

Ah, the actual issue here is needing rpath to contain the right directory. 
Including libutf8proc implicitly does that, but it seems ARROW-4065 
intentionally removed the transitive dependencies from ArrowTargets.cmake. 
Instead downstream projects depending on Arrow can use 
{{target_link_directories(..., path/to/conda/env/lib)}} (it seems this is 
really only an issue when using Conda). Closing.

> [C++][Docs] Trivial CMake dependency on Arrow fails at link stage
> -
>
> Key: ARROW-11609
> URL: https://issues.apache.org/jira/browse/ARROW-11609
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Affects Versions: 3.0.0
>Reporter: David Li
>Priority: Major
>
> The example in the docs here isn't sufficient: 
> [https://arrow.apache.org/docs/cpp/cmake.html] 
> It fails at link time because Arrow's transitive dependencies aren't included 
> in the INTERFACE_LINK_LIBRARIES:
> {noformat}
> /usr/bin/ld: warning: libglog.so.0, needed by 
> /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try 
> using -rpath or -rpath-link)
> /usr/bin/ld: warning: libutf8proc.so.2, needed by 
> /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try 
> using -rpath or -rpath-link)
> /usr/bin/ld: warning: libaws-cpp-sdk-config.so, needed by 
> /home/lidavidm/Code/Ursa/install/lib/libarrow.so.400.0.0, not found (try 
> using -rpath or -rpath-link)
> # ...{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11610) [C++] Download boost from sourceforge instead of bintray

2021-02-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-11610:
---

 Summary: [C++] Download boost from sourceforge instead of bintray
 Key: ARROW-11610
 URL: https://issues.apache.org/jira/browse/ARROW-11610
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 4.0.0


e.g. 
https://sourceforge.net/projects/boost/files/boost/1.67.0/boost_1_67_0.tar.gz



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11611) [C++] Move third party dependency mirrors from bintray

2021-02-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-11611:
---

 Summary: [C++] Move third party dependency mirrors from bintray
 Key: ARROW-11611
 URL: https://issues.apache.org/jira/browse/ARROW-11611
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Neal Richardson
 Fix For: 4.0.0


We added copies of these a while back to handle rate limiting to our own 
bintray. We should either remove them or update and move them elsewhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11612) [C++] Rebuild trimmed boost bundle

2021-02-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-11612:
---

 Summary: [C++] Rebuild trimmed boost bundle
 Key: ARROW-11612
 URL: https://issues.apache.org/jira/browse/ARROW-11612
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Neal Richardson
 Fix For: 4.0.0


And host somewhere other than bintray. We can prune it further now that we've 
dropped boost::regex, too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11613) [R] Move nightly C++ builds off of bintray

2021-02-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-11613:
---

 Summary: [R] Move nightly C++ builds off of bintray
 Key: ARROW-11613
 URL: https://issues.apache.org/jira/browse/ARROW-11613
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11499) [Packaging] Remove all use of bintray

2021-02-12 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11499:

Description: 
Bintray is being shut down on May 1. 

https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/

I've made subtasks for the bintray usage other than the 
dl.bintray.com/apache/arrow repository we use for hosting release artifacts.

  was:
Bintray is being shut down on May 1, and possibly as early as February 28 we 
won't be able to upload to it. 

https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/

Feel free to make subtasks to break out this work.


> [Packaging] Remove all use of bintray
> -
>
> Key: ARROW-11499
> URL: https://issues.apache.org/jira/browse/ARROW-11499
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Packaging
>Reporter: Neal Richardson
>Priority: Blocker
> Fix For: 4.0.0
>
>
> Bintray is being shut down on May 1. 
> https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/
> I've made subtasks for the bintray usage other than the 
> dl.bintray.com/apache/arrow repository we use for hosting release artifacts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11610) [C++] Download boost from sourceforge instead of bintray

2021-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11610:
---
Labels: pull-request-available  (was: )

> [C++] Download boost from sourceforge instead of bintray
> 
>
> Key: ARROW-11610
> URL: https://issues.apache.org/jira/browse/ARROW-11610
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> e.g. 
> https://sourceforge.net/projects/boost/files/boost/1.67.0/boost_1_67_0.tar.gz



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11611) [C++] Update third party dependency mirrors

2021-02-12 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11611:

Summary: [C++] Update third party dependency mirrors  (was: [C++] Move 
third party dependency mirrors from bintray)

> [C++] Update third party dependency mirrors
> ---
>
> Key: ARROW-11611
> URL: https://issues.apache.org/jira/browse/ARROW-11611
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 4.0.0
>
>
> We added copies of these a while back to handle rate limiting to our own 
> bintray. We should either remove them or update and move them elsewhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11611) [C++] Update third party dependency mirrors

2021-02-12 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-11611:
---

Assignee: Ben Kietzman

> [C++] Update third party dependency mirrors
> ---
>
> Key: ARROW-11611
> URL: https://issues.apache.org/jira/browse/ARROW-11611
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 4.0.0
>
>
> We added copies of these a while back as GitHub releases to handle rate 
> limiting to our own bintray. We've since bumped our dependency versions but 
> didn't update our copies in these mirrors, so they're currently useless.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11611) [C++] Update third party dependency mirrors

2021-02-12 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11611:

Description: We added copies of these a while back as GitHub releases to 
handle rate limiting to our own bintray. We've since bumped our dependency 
versions but didn't update our copies in these mirrors, so they're currently 
useless.  (was: We added copies of these a while back to handle rate limiting 
to our own bintray. We should either remove them or update and move them 
elsewhere.)

> [C++] Update third party dependency mirrors
> ---
>
> Key: ARROW-11611
> URL: https://issues.apache.org/jira/browse/ARROW-11611
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 4.0.0
>
>
> We added copies of these a while back as GitHub releases to handle rate 
> limiting to our own bintray. We've since bumped our dependency versions but 
> didn't update our copies in these mirrors, so they're currently useless.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11611) [C++] Update third party dependency mirrors

2021-02-12 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11611:

Parent: (was: ARROW-11499)
Issue Type: Improvement  (was: Sub-task)

> [C++] Update third party dependency mirrors
> ---
>
> Key: ARROW-11611
> URL: https://issues.apache.org/jira/browse/ARROW-11611
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 4.0.0
>
>
> We added copies of these a while back as GitHub releases to handle rate 
> limiting to our own bintray. We've since bumped our dependency versions but 
> didn't update our copies in these mirrors, so they're currently useless.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11593) Parquet does not support wasm32-unknown-unknown target

2021-02-12 Thread Dominik Moritz (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284005#comment-17284005
 ] 

Dominik Moritz commented on ARROW-11593:


If lz4 is the issue, maybe we could switch to 
https://github.com/PSeitz/lz4_flex, which compiles to WASM. 

> Parquet does not support wasm32-unknown-unknown target
> --
>
> Key: ARROW-11593
> URL: https://issues.apache.org/jira/browse/ARROW-11593
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Dominik Moritz
>Priority: Major
>
> The Arrow crate successfully compiles to WebAssembly (e.g. 
> https://github.com/domoritz/arrow-wasm) but the Parquet crate currently does 
> not support the`wasm32-unknown-unknown` target. 
> Try out the repository at 
> https://github.com/domoritz/parquet-wasm/commit/e877f9ad9c45c09f73d98fab2a8ad384a802b2e0.
>  The problem seems to be in liblz4, even if I do not include lz4 in the 
> feature flags.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11614) [C++][Gandiva] Fix round() logic to return positive zero when argument is zero

2021-02-12 Thread Sagnik Chakraborty (Jira)
Sagnik Chakraborty created ARROW-11614:
--

 Summary: [C++][Gandiva] Fix round() logic to return positive zero 
when argument is zero
 Key: ARROW-11614
 URL: https://issues.apache.org/jira/browse/ARROW-11614
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Sagnik Chakraborty


Previously, round(0.0) and round(0.0, out_scale) were returning -0.0, with this 
patch round() returns +0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11614) [C++][Gandiva] Fix round() logic to return positive zero when argument is zero

2021-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11614:
---
Labels: pull-request-available  (was: )

> [C++][Gandiva] Fix round() logic to return positive zero when argument is zero
> --
>
> Key: ARROW-11614
> URL: https://issues.apache.org/jira/browse/ARROW-11614
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Previously, round(0.0) and round(0.0, out_scale) were returning -0.0, with 
> this patch round() returns +0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction

2021-02-12 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb reassigned ARROW-11606:
---

Assignee: Andy Grove

> [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction
> -
>
> Key: ARROW-11606
> URL: https://issues.apache.org/jira/browse/ARROW-11606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We have run into an issue in the Ballista project where we are reconstructing 
> the Final and Partial HashAggregateExec operators [1] for distributed 
> execution and we need some guidance.
> The Partial HashAggregateExec gets created OK and executes correctly.
> However, when we create the Final HashAggregateExec, it is not finding the 
> expected schema in the input operator. The partial exec outputs field names 
> ending with "[sum]" and "[count]" and so on but the final aggregate doesn't 
> seem to be looking for those names.
> It is also worth noting that the Final and Partial executors are not 
> connected directly in this usage.
> The Partial exec is executed and output streamed to disk.
> The Final exec then runs against the output from the Partial exec.
> We may need to make changes in DataFusion to allow other crates to support 
> this kind of use case?
>  [1] https://github.com/ballista-compute/ballista/pull/491
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction

2021-02-12 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11606.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9481
[https://github.com/apache/arrow/pull/9481]

> [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction
> -
>
> Key: ARROW-11606
> URL: https://issues.apache.org/jira/browse/ARROW-11606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We have run into an issue in the Ballista project where we are reconstructing 
> the Final and Partial HashAggregateExec operators [1] for distributed 
> execution and we need some guidance.
> The Partial HashAggregateExec gets created OK and executes correctly.
> However, when we create the Final HashAggregateExec, it is not finding the 
> expected schema in the input operator. The partial exec outputs field names 
> ending with "[sum]" and "[count]" and so on but the final aggregate doesn't 
> seem to be looking for those names.
> It is also worth noting that the Final and Partial executors are not 
> connected directly in this usage.
> The Partial exec is executed and output streamed to disk.
> The Final exec then runs against the output from the Partial exec.
> We may need to make changes in DataFusion to allow other crates to support 
> this kind of use case?
>  [1] https://github.com/ballista-compute/ballista/pull/491
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11586) [Rust] [Datafusion] Invalid SQL sometimes panics

2021-02-12 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11586.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9479
[https://github.com/apache/arrow/pull/9479]

> [Rust] [Datafusion] Invalid SQL sometimes panics
> 
>
> Key: ARROW-11586
> URL: https://issues.apache.org/jira/browse/ARROW-11586
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Marc Prud'hommeaux
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Executing the invalid SQL "select 1 order by x" will panic rather returning 
> an Err:
>  ```
> thread '' panicked at 'called `Result::unwrap()` on an `Err` value: 
> Plan("Invalid identifier \'x\' for schema Int64(1)")', 
> /Users/marc/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/643f420/rust/datafusion/src/sql/planner.rs:649:76
> stack backtrace:
>0: _rust_begin_unwind
>1: core::panicking::panic_fmt
>2: core::option::expect_none_failed
>3: core::result::Result::unwrap
>4: datafusion::sql::planner::SqlToRel::order_by::{{closure}}
>5: core::iter::adapters::map_try_fold::{{closure}}
>6: core::iter::traits::iterator::Iterator::try_fold
>7:  as 
> core::iter::traits::iterator::Iterator>::try_fold
>8:  as 
> core::iter::traits::iterator::Iterator>::try_fold
>9: core::iter::traits::iterator::Iterator::find
>   10:  as 
> core::iter::traits::iterator::Iterator>::next
>   11:  as alloc::vec::SpecFromIterNested>::from_iter
>   12:  as alloc::vec::SpecFromIter>::from_iter
>   13:  as 
> core::iter::traits::collect::FromIterator>::from_iter
>   14: core::iter::traits::iterator::Iterator::collect
>   15:  as 
> core::iter::traits::collect::FromIterator>>::from_iter::{{closure}}
>   16: core::iter::adapters::process_results
>   17:  as 
> core::iter::traits::collect::FromIterator>>::from_iter
>   18: core::iter::traits::iterator::Iterator::collect
>   19: datafusion::sql::planner::SqlToRel::order_by
>   20: datafusion::sql::planner::SqlToRel::query_to_plan
>   21: datafusion::sql::planner::SqlToRel::sql_statement_to_plan
>   22: datafusion::sql::planner::SqlToRel::statement_to_plan
>   23: datafusion::execution::context::ExecutionContext::create_logical_plan
> ```
> This is happening because of an `unwrap` at 
> https://github.com/apache/arrow/blob/6cfbd22b457d873365fa60df31905857856608ee/rust/datafusion/src/sql/planner.rs#L652.
>  
> Perhaps the error should be returned as the Result rather than panicking, so 
> the error can be handled? There are a number of other places in the planner 
> where `unwrap()` is used, so they may warrant similar treatment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11586) [Rust] [Datafusion] Invalid SQL sometimes panics

2021-02-12 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-11586:

Component/s: Rust - DataFusion

> [Rust] [Datafusion] Invalid SQL sometimes panics
> 
>
> Key: ARROW-11586
> URL: https://issues.apache.org/jira/browse/ARROW-11586
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Marc Prud'hommeaux
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Executing the invalid SQL "select 1 order by x" will panic rather returning 
> an Err:
>  ```
> thread '' panicked at 'called `Result::unwrap()` on an `Err` value: 
> Plan("Invalid identifier \'x\' for schema Int64(1)")', 
> /Users/marc/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/643f420/rust/datafusion/src/sql/planner.rs:649:76
> stack backtrace:
>0: _rust_begin_unwind
>1: core::panicking::panic_fmt
>2: core::option::expect_none_failed
>3: core::result::Result::unwrap
>4: datafusion::sql::planner::SqlToRel::order_by::{{closure}}
>5: core::iter::adapters::map_try_fold::{{closure}}
>6: core::iter::traits::iterator::Iterator::try_fold
>7:  as 
> core::iter::traits::iterator::Iterator>::try_fold
>8:  as 
> core::iter::traits::iterator::Iterator>::try_fold
>9: core::iter::traits::iterator::Iterator::find
>   10:  as 
> core::iter::traits::iterator::Iterator>::next
>   11:  as alloc::vec::SpecFromIterNested>::from_iter
>   12:  as alloc::vec::SpecFromIter>::from_iter
>   13:  as 
> core::iter::traits::collect::FromIterator>::from_iter
>   14: core::iter::traits::iterator::Iterator::collect
>   15:  as 
> core::iter::traits::collect::FromIterator>>::from_iter::{{closure}}
>   16: core::iter::adapters::process_results
>   17:  as 
> core::iter::traits::collect::FromIterator>>::from_iter
>   18: core::iter::traits::iterator::Iterator::collect
>   19: datafusion::sql::planner::SqlToRel::order_by
>   20: datafusion::sql::planner::SqlToRel::query_to_plan
>   21: datafusion::sql::planner::SqlToRel::sql_statement_to_plan
>   22: datafusion::sql::planner::SqlToRel::statement_to_plan
>   23: datafusion::execution::context::ExecutionContext::create_logical_plan
> ```
> This is happening because of an `unwrap` at 
> https://github.com/apache/arrow/blob/6cfbd22b457d873365fa60df31905857856608ee/rust/datafusion/src/sql/planner.rs#L652.
>  
> Perhaps the error should be returned as the Result rather than panicking, so 
> the error can be handled? There are a number of other places in the planner 
> where `unwrap()` is used, so they may warrant similar treatment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza commented on ARROW-6154:
---

I've come across the same issue. It appears to be in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` call 
here eventually fails as there are too many file handles open.

> [Rust] [Parquet] Too many open files (os error 24)
> --
>
> Key: ARROW-6154
> URL: https://issues.apache.org/jira/browse/ARROW-6154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Yesh
>Priority: Major
>
> Used [rust]*parquet-read binary to read a deeply nested parquet file and see 
> the below stack trace. Unfortunately won't be able to upload file.*
> {code:java}
> stack backtrace:
>    0: std::panicking::default_hook::{{closure}}
>    1: std::panicking::default_hook
>    2: std::panicking::rust_panic_with_hook
>    3: std::panicking::continue_panic_fmt
>    4: rust_begin_unwind
>    5: core::panicking::panic_fmt
>    6: core::result::unwrap_failed
>    7: parquet::util::io::FileSource::new
>    8:  as 
> parquet::file::reader::RowGroupReader>::get_column_page_reader
>    9:  as 
> parquet::file::reader::RowGroupReader>::get_column_reader
>   10: parquet::record::reader::TreeBuilder::reader_tree
>   11: parquet::record::reader::TreeBuilder::reader_tree
>   12: parquet::record::reader::TreeBuilder::reader_tree
>   13: parquet::record::reader::TreeBuilder::reader_tree
>   14: parquet::record::reader::TreeBuilder::reader_tree
>   15: parquet::record::reader::TreeBuilder::build
>   16:  core::iter::traits::iterator::Iterator>::next
>   17: parquet_read::main
>   18: std::rt::lang_start::{{closure}}
>   19: std::panicking::try::do_call
>   20: __rust_maybe_catch_panic
>   21: std::rt::lang_start_internal
>   22: main{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:02 PM:
--

I've come across the same issue. It appears to be in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` call 
here eventually fails as there are too many file handles open.

Here's a stack trace from `gdb` which leads to the call in `io.rs`:
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/portfolio.parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same issue. It appears to be in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` call 
here eventually fails as there are too many file handles open.

> [Rust] [Parquet] Too many open files (os error 24)
> --
>
> Key: ARROW-6154
> URL: https://issues.apache.org/jira/browse/ARROW-6154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Yesh
>Priority: Major
>
> Used [rust]*parquet-read binary to read a deeply nested parquet file and see 
> the below stack trace. Unfortunately won't be able to upload file.*
> {code:java}
> stack backtrace:
>    0: std::panicking::default_hook::{{closure}}
>    1: std::panicking::default_hook
>    2: std::panicking::rust_panic_with_hook
>    3: std::panicking::continue_panic_fmt
>    4: rust_begin_unwind
>    5: core::panicking::panic_fmt
>    6: core::result::unwrap_failed
>    7: parquet::util::io::FileSource::new
>    8:  as 
> parquet::file::reader::RowGroupReader>::get_column_page_reader
>    9:  as 
> parquet::file::reader::RowGroupReader>::get_column_reader
>   10: parquet::record::reader::TreeBuilder::reader_tree
>   11: parquet::record::reader::TreeBuilder::reader_tree
>   12: parquet::record::reader::TreeBuilder::reader_tree
>   13: parquet::record::reader::TreeBuilder::reader_tree
>   14: parquet::record::reader::TreeBuilder::reader_tree
>   15: parquet::record::reader::TreeBuilder::build
>   16:  core::iter::traits::iterator::Iterator>::next
>   17: parquet_read::main
>   18: std::rt::lang_start::{{closure}}
>   19: std::panicking::try::do_call
>   20: __rust_maybe_catch_panic
>   21: std::rt::lang_start_internal
>   22: main{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:03 PM:
--

I've come across the same issue. It appears to be in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` call 
here eventually fails as there are too many file handles open.

Here's a stack trace from `gdb` which leads to the call in `io.rs`:
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same issue. It appears to be in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` call 
here eventually fails as there are too many file handles open.

Here's a stack trace from `gdb` which leads to the call in `io.rs`:
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/portfolio.parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}

> [Rust] [Parquet] Too many open files (os error 24)
> --
>
> Key: ARROW-6154
> URL: https://issues.apache.org/jira/browse/ARROW-6154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Yesh
>Priority: Major
>
> Used [rust]*parquet-read binary to read a deeply nested parquet file and see 
> the below stack trace. Unfortunately won't be able to upload file.*
> {code:java}
> stack backtrace

[jira] [Updated] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Riza updated ARROW-6154:
--
Attachment: 
part-9-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet

> [Rust] [Parquet] Too many open files (os error 24)
> --
>
> Key: ARROW-6154
> URL: https://issues.apache.org/jira/browse/ARROW-6154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Yesh
>Priority: Major
> Attachments: 
> part-9-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet
>
>
> Used [rust]*parquet-read binary to read a deeply nested parquet file and see 
> the below stack trace. Unfortunately won't be able to upload file.*
> {code:java}
> stack backtrace:
>    0: std::panicking::default_hook::{{closure}}
>    1: std::panicking::default_hook
>    2: std::panicking::rust_panic_with_hook
>    3: std::panicking::continue_panic_fmt
>    4: rust_begin_unwind
>    5: core::panicking::panic_fmt
>    6: core::result::unwrap_failed
>    7: parquet::util::io::FileSource::new
>    8:  as 
> parquet::file::reader::RowGroupReader>::get_column_page_reader
>    9:  as 
> parquet::file::reader::RowGroupReader>::get_column_reader
>   10: parquet::record::reader::TreeBuilder::reader_tree
>   11: parquet::record::reader::TreeBuilder::reader_tree
>   12: parquet::record::reader::TreeBuilder::reader_tree
>   13: parquet::record::reader::TreeBuilder::reader_tree
>   14: parquet::record::reader::TreeBuilder::reader_tree
>   15: parquet::record::reader::TreeBuilder::build
>   16:  core::iter::traits::iterator::Iterator>::next
>   17: parquet_read::main
>   18: std::rt::lang_start::{{closure}}
>   19: std::panicking::try::do_call
>   20: __rust_maybe_catch_panic
>   21: std::rt::lang_start_internal
>   22: main{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:06 PM:
--

I've come across the same issue. It appears to be in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` call 
here eventually fails as there are too many file handles open.

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same issue. It appears to be in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` call 
here eventually fails as there are too many file handles open.

Here's a stack trace from `gdb` which leads to the call in `io.rs`:
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}

> [Rust] [Parquet] Too many open files (os error 24)
> --
>
> Key: ARROW-6154
> URL: https://issues.apache.org/jira/browse/ARROW-6154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Yesh
>Priority: Major
> Attachments: 
> part-9-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet
>
>
> Used [rust]*parquet-read binary

[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:08 PM:
--

I've come across the same issue. It appears to be due to the `try_clone` calls 
in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` calls 
here eventually fail as it ends up creating too many open file descriptors (I'm 
running this on Linux).

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same issue. It appears to be in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` call 
here eventually fails as there are too many file handles open.

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}

> [Rust] [Parquet] Too many open files (os error 24)
> --
>
> Key: ARROW-6154
> URL: https://issues.apache.org/jira/browse/ARROW-6154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Yesh
>Priority: Ma

[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:09 PM:
--

I've come across the same issue. It appears to be due to the `try_clone` calls 
in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` calls 
here eventually fail as it ends up creating too many open file descriptors (I'm 
running this on Linux, {color:#00}Fedora release 33 and rustc 1.50.0 
(cb75ad5db 2021-02-10)).{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same issue. It appears to be due to the `try_clone` calls 
in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` calls 
here eventually fail as it ends up creating too many open file descriptors (I'm 
running this on Linux).

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}

> [Rust] [Parquet] Too many open files (os error 24)
> --
>
> Key: ARROW-6154
> URL: https://issues.apache.org/jira/browse/

[jira] [Commented] (ARROW-9392) [C++] Document more of the compute layer

2021-02-12 Thread Aldrin (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284029#comment-17284029
 ] 

Aldrin commented on ARROW-9392:
---

Hello [~apitrou]! I am interested in helping out with this, at *least* with the 
portions that I will be using significantly in the near future.

 

I'm not sure there's much to do here, but I have just had trouble finding 
documentation myself and wanted to volunteer to contribute. (I posted to the 
mailing list in case a lot of this already exists: 
https://lists.apache.org/thread.html/rb0633480a9cf07d311d3a1143c2be1bce3a83e6ae5cf281ebb2cff9b%40%3Cdev.arrow.apache.org%3E)

 

For reference, my usage of the APIs will be related to ARROW-10549, but with a 
different end goal.

> [C++] Document more of the compute layer
> 
>
> Key: ARROW-9392
> URL: https://issues.apache.org/jira/browse/ARROW-9392
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 4.0.0
>
>
> Ideally, we should add:
> * a description and examples of how to call compute functions
> * an API reference for concrete C++ functions such as {{Cast}}, 
> {{NthToIndices}}, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9392) [C++] Document more of the compute layer

2021-02-12 Thread Aldrin (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284029#comment-17284029
 ] 

Aldrin edited comment on ARROW-9392 at 2/12/21, 11:14 PM:
--

Hello [~apitrou]! I am interested in helping out with this, at *least* with the 
portions that I will be using significantly in the near future. I figured 
pinging you first to orient myself made sense since you created this issue.

 

I'm not sure how much there is to do here, but I have just had trouble finding 
documentation myself and wanted to volunteer to contribute. (I posted to the 
mailing list in case a lot of this already exists: 
[https://lists.apache.org/thread.html/rb0633480a9cf07d311d3a1143c2be1bce3a83e6ae5cf281ebb2cff9b%40%3Cdev.arrow.apache.org%3E])

 

For reference, my usage of the APIs will be related to ARROW-10549, but with a 
different end goal.

 

Thanks!


was (Author: octalene):
Hello [~apitrou]! I am interested in helping out with this, at *least* with the 
portions that I will be using significantly in the near future.

 

I'm not sure there's much to do here, but I have just had trouble finding 
documentation myself and wanted to volunteer to contribute. (I posted to the 
mailing list in case a lot of this already exists: 
https://lists.apache.org/thread.html/rb0633480a9cf07d311d3a1143c2be1bce3a83e6ae5cf281ebb2cff9b%40%3Cdev.arrow.apache.org%3E)

 

For reference, my usage of the APIs will be related to ARROW-10549, but with a 
different end goal.

> [C++] Document more of the compute layer
> 
>
> Key: ARROW-9392
> URL: https://issues.apache.org/jira/browse/ARROW-9392
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 4.0.0
>
>
> Ideally, we should add:
> * a description and examples of how to call compute functions
> * an API reference for concrete C++ functions such as {{Cast}}, 
> {{NthToIndices}}, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11566) [Python][Parquet] Use pypi condition package to filter partitions in a user friendly way

2021-02-12 Thread Weiyang Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiyang Zhao updated ARROW-11566:
-
Description: 
I created the pypi condition package to allow user friendly expression of 
conditions. For example, a condition can be written as:

(f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2'] 

where A, B, C are partition keys.

For usage details, please see its document at: 

[https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]

 

Arbitrary condition objects can be converted to pyarrow's filter by calling its

to_pyarrow_filter() method:

[https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]

The above method will normalize the condition to conform to pyarrow filter 
specification.

 

Furthermore, the condition object be directly used to evaluate partition paths. 
This can replace the current complex filtering codes. (both native and python)

For max efficiency, filtering with the condition object can be done in the 
below ways:
 # read the paths in chunks to keep the memory footprint small;
 # parse the paths to be a pandas dataframe;
 # use condition.query(dataframe) to get the filtered dataframe of path.
 # use numexpr backend for dataframe query for efficiency.

Please discuss.

  was:
I created the pypi condition package to allow user friendly expression of 
conditions. For example, a condition can be:

(A <= 3 or B != 'b1') and C == ['c1', 'c2'] 

For usage details, please see its document at: 

[https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]

 

Arbitrary condition objects can be converted to pyarrow's filter by calling its

to_pyarrow_filter() method:

[https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]

The above method will normalize the condition to conform to pyarrow filter 
specification.

 

Furthermore, the condition object be directly used to evaluate partition paths. 
This can replace the current complex filtering codes. (both native and python)

For max efficiency, filtering with the condition object can be done in the 
below ways:
 # read the paths in chunks to keep the memory footprint small;
 # parse the paths to be a pandas dataframe;
 # use condition.query(dataframe) to get the filtered dataframe of path.
 # use numexpr backend for dataframe query for efficiency.

Please discuss.


> [Python][Parquet] Use pypi condition package to filter partitions in a user 
> friendly way
> 
>
> Key: ARROW-11566
> URL: https://issues.apache.org/jira/browse/ARROW-11566
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weiyang Zhao
>Assignee: Weiyang Zhao
>Priority: Major
>
> I created the pypi condition package to allow user friendly expression of 
> conditions. For example, a condition can be written as:
> (f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2'] 
> where A, B, C are partition keys.
> For usage details, please see its document at: 
> [https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]
>  
> Arbitrary condition objects can be converted to pyarrow's filter by calling 
> its
> to_pyarrow_filter() method:
> [https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]
> The above method will normalize the condition to conform to pyarrow filter 
> specification.
>  
> Furthermore, the condition object be directly used to evaluate partition 
> paths. This can replace the current complex filtering codes. (both native and 
> python)
> For max efficiency, filtering with the condition object can be done in the 
> below ways:
>  # read the paths in chunks to keep the memory footprint small;
>  # parse the paths to be a pandas dataframe;
>  # use condition.query(dataframe) to get the filtered dataframe of path.
>  # use numexpr backend for dataframe query for efficiency.
> Please discuss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:29 PM:
--

I've come across the same issue. It appears to be due to the `try_clone` calls 
in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` calls 
here eventually fail as it ends up creating too many open file descriptors (I'm 
running this on Linux, {color:#00}Fedora release 33 and rustc 1.50.0 
(cb75ad5db 2021-02-10)).{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same issue. It appears to be due to the `try_clone` calls 
in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` calls 
here eventually fail as it ends up creating too many open file descriptors (I'm 
running this on Linux, {color:#00}Fedora release 33 and rustc 1.50.0 
(cb75ad5db 2021-02-10)).{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_file

[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:31 PM:
--

I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` calls 
here eventually fail as it ends up creating too many open file 
descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same error (potentially from a different location). In my 
case it appears to be due to the `try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` calls 
here eventually fail as it ends up creating too many open file 
descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x

[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:31 PM:
--

I've come across the same error (potentially from a different location). In my 
case it appears to be due to the `try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` calls 
here eventually fail as it ends up creating too many open file 
descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same issue. It appears to be due to the `try_clone` calls 
in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` calls 
here eventually fail as it ends up creating too many open file descriptors (I'm 
running this on Linux, {color:#00}Fedora release 33 and rustc 1.50.0 
(cb75ad5db 2021-02-10)).{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.co

[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:32 PM:
--

I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
In my case I have a Parquet file with 3000 columns, and the `try_clone` calls 
here eventually fail as it ends up creating too many open file 
descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::

[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:46 PM:
--

I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.

This is the initial stack trace when the footer is first read.  The code in 
`io.rs` gets called for every column subsequently as well when reading the 
columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in 
`parquet/record/reader.rs`)

 
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_r

[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:46 PM:
--

I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.

This is the initial stack trace when the footer is first read.  The code in 
`FileSource::new` gets called for every column subsequently as 
well when reading the columns (see {color:#cc844f}fn 
{color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`)

 
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.

This is the initial stack trace when the footer is first read.  The code in 
`io.rs` gets called for every column subsequently as well when reading the 
columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in 
`parquet/record/reader.rs`)

 
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader

[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:47 PM:
--

I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.

This is the initial stack trace when the footer is first read.  
`FileSource::new` (in io.rs) gets called for every column 
subsequently as well when reading the columns (see {color:#cc844f}fn 
{color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`)

 
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.

This is the initial stack trace when the footer is first read.  The code in 
`FileSource::new` gets called for every column subsequently as 
well when reading the columns (see {color:#cc844f}fn 
{color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`)

 
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serializ

[jira] [Created] (ARROW-11615) DataFusion does not support wasm32-unknown-unknown target

2021-02-12 Thread Dominik Moritz (Jira)
Dominik Moritz created ARROW-11615:
--

 Summary: DataFusion does not support wasm32-unknown-unknown target
 Key: ARROW-11615
 URL: https://issues.apache.org/jira/browse/ARROW-11615
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Dominik Moritz


The Arrow crate successfully compiles to WebAssembly (e.g. 
https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently does 
not support the`wasm32-unknown-unknown` target.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

2021-02-12 Thread Ahmed Riza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ] 

Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:59 PM:
--

I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors (one for each column in the Parquet file).

This is the initial stack trace when the footer is first read.  
`FileSource::new` (in io.rs) gets called for every column 
subsequently as well when reading the columns (see {color:#cc844f}fn 
{color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`)

 
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x55845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x55845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-1-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x5577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#00}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.

This is the initial stack trace when the footer is first read.  
`FileSource::new` (in io.rs) gets called for every column 
subsequently as well when reading the columns (see {color:#cc844f}fn 
{color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`)

 
{code:java}
#0  parquet::util::io::FileSource::new 
(fd=0x77c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x77c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x5590a3fc in parquet::file::footer::parse_metadata 
(chunk_reader=0x77c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x55845db1 in 
parquet::file::serialized_reader::SerializedFileReader::new
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x55845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc629

[jira] [Updated] (ARROW-11615) DataFusion does not support wasm32-unknown-unknown target

2021-02-12 Thread Dominik Moritz (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Moritz updated ARROW-11615:
---
Description: 
The Arrow crate successfully compiles to WebAssembly (e.g. 
https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently does 
not support the`wasm32-unknown-unknown` target.

Try out the repository at 
https://github.com/domoritz/datafusion-wasm/tree/73105fd1b2e3ca6c32ec4652c271fb741bda419a.
 


{code}
error[E0433]: failed to resolve: could not find `unix` in `os`
  --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18
   |
41 | use std::os::unix::ffi::OsStringExt;
   |   could not find `unix` in `os`

error[E0432]: unresolved import `unix`
 --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5
  |
6 | use unix;
  |  no `unix` in the root

error[E0433]: failed to resolve: use of undeclared crate or module `sys`
  --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:98:9
   |
98 | sys::duplicate(self)
   | ^^^ use of undeclared crate or module `sys`

error[E0433]: failed to resolve: use of undeclared crate or module `sys`
   --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:101:9
|
101 | sys::allocated_size(self)
| ^^^ use of undeclared crate or module `sys`

error[E0433]: failed to resolve: use of undeclared crate or module `sys`
   --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:104:9
|
104 | sys::allocate(self, len)
| ^^^ use of undeclared crate or module `sys`

error[E0433]: failed to resolve: use of undeclared crate or module `sys`
   --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:107:9
|
107 | sys::lock_shared(self)
| ^^^ use of undeclared crate or module `sys`

error[E0433]: failed to resolve: use of undeclared crate or module `sys`
   --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:110:9
|
110 | sys::lock_exclusive(self)
| ^^^ use of undeclared crate or module `sys`

error[E0433]: failed to resolve: use of undeclared crate or module `sys`
   --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:113:9
|
113 | sys::try_lock_shared(self)
| ^^^ use of undeclared crate or module `sys`

error[E0433]: failed to resolve: use of undeclared crate or module `sys`
   --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:116:9
|
116 | sys::try_lock_exclusive(self)
| ^^^ use of undeclared crate or module `sys`

error[E0433]: failed to resolve: use of undeclared crate or module `sys`
   --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:119:9
|
119 | sys::unlock(self)
| ^^^ use of undeclared crate or module `sys`

error[E0433]: failed to resolve: use of undeclared crate or module `sys`
   --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:126:5
|
126 | sys::lock_error()
| ^^^ use of undeclared crate or module `sys`

error[E0433]: failed to resolve: use of undeclared crate or module `sys`
   --> 
/Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:169:5
|
169 | sys::statvfs(path.as_ref())
| ^^^ use of undeclared crate or module `sys`

   Compiling num-rational v0.3.2
error: aborting due to 10 previous errors
{code}


  was:The Arrow crate successfully compiles to WebAssembly (e.g. 
https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently does 
not support the`wasm32-unknown-unknown` target.


> DataFusion does not support wasm32-unknown-unknown target
> -
>
> Key: ARROW-11615
> URL: https://issues.apache.org/jira/browse/ARROW-11615
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Dominik Moritz
>Priority: Major
>
> The Arrow crate successfully compiles to WebAssembly (e.g. 
> https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently 
> does not support the`wasm32-unknown-unknown` target.
> Try out the repository at 
> https://github.com/domoritz/datafusion-wasm/tree/73105fd1b2e3ca6c32ec4652c271fb741bda419a.
>  
> {code}
> error[E0433]: failed to resolve: could not find `unix` in `os`
>   --> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18
>|
> 41 | use std::os::unix::ffi::OsStringExt;
>|   could no

[jira] [Updated] (ARROW-11593) [Rust] Parquet does not support wasm32-unknown-unknown target

2021-02-12 Thread Dominik Moritz (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Moritz updated ARROW-11593:
---
Summary: [Rust] Parquet does not support wasm32-unknown-unknown target  
(was: Parquet does not support wasm32-unknown-unknown target)

> [Rust] Parquet does not support wasm32-unknown-unknown target
> -
>
> Key: ARROW-11593
> URL: https://issues.apache.org/jira/browse/ARROW-11593
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Dominik Moritz
>Priority: Major
>
> The Arrow crate successfully compiles to WebAssembly (e.g. 
> https://github.com/domoritz/arrow-wasm) but the Parquet crate currently does 
> not support the`wasm32-unknown-unknown` target. 
> Try out the repository at 
> https://github.com/domoritz/parquet-wasm/commit/e877f9ad9c45c09f73d98fab2a8ad384a802b2e0.
>  The problem seems to be in liblz4, even if I do not include lz4 in the 
> feature flags.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11615) [Rust] DataFusion does not support wasm32-unknown-unknown target

2021-02-12 Thread Dominik Moritz (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Moritz updated ARROW-11615:
---
Summary: [Rust] DataFusion does not support wasm32-unknown-unknown target  
(was: DataFusion does not support wasm32-unknown-unknown target)

> [Rust] DataFusion does not support wasm32-unknown-unknown target
> 
>
> Key: ARROW-11615
> URL: https://issues.apache.org/jira/browse/ARROW-11615
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Dominik Moritz
>Priority: Major
>
> The Arrow crate successfully compiles to WebAssembly (e.g. 
> https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently 
> does not support the`wasm32-unknown-unknown` target.
> Try out the repository at 
> https://github.com/domoritz/datafusion-wasm/tree/73105fd1b2e3ca6c32ec4652c271fb741bda419a.
>  
> {code}
> error[E0433]: failed to resolve: could not find `unix` in `os`
>   --> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18
>|
> 41 | use std::os::unix::ffi::OsStringExt;
>|   could not find `unix` in `os`
> error[E0432]: unresolved import `unix`
>  --> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5
>   |
> 6 | use unix;
>   |  no `unix` in the root
> error[E0433]: failed to resolve: use of undeclared crate or module `sys`
>   --> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:98:9
>|
> 98 | sys::duplicate(self)
>| ^^^ use of undeclared crate or module `sys`
> error[E0433]: failed to resolve: use of undeclared crate or module `sys`
>--> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:101:9
> |
> 101 | sys::allocated_size(self)
> | ^^^ use of undeclared crate or module `sys`
> error[E0433]: failed to resolve: use of undeclared crate or module `sys`
>--> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:104:9
> |
> 104 | sys::allocate(self, len)
> | ^^^ use of undeclared crate or module `sys`
> error[E0433]: failed to resolve: use of undeclared crate or module `sys`
>--> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:107:9
> |
> 107 | sys::lock_shared(self)
> | ^^^ use of undeclared crate or module `sys`
> error[E0433]: failed to resolve: use of undeclared crate or module `sys`
>--> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:110:9
> |
> 110 | sys::lock_exclusive(self)
> | ^^^ use of undeclared crate or module `sys`
> error[E0433]: failed to resolve: use of undeclared crate or module `sys`
>--> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:113:9
> |
> 113 | sys::try_lock_shared(self)
> | ^^^ use of undeclared crate or module `sys`
> error[E0433]: failed to resolve: use of undeclared crate or module `sys`
>--> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:116:9
> |
> 116 | sys::try_lock_exclusive(self)
> | ^^^ use of undeclared crate or module `sys`
> error[E0433]: failed to resolve: use of undeclared crate or module `sys`
>--> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:119:9
> |
> 119 | sys::unlock(self)
> | ^^^ use of undeclared crate or module `sys`
> error[E0433]: failed to resolve: use of undeclared crate or module `sys`
>--> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:126:5
> |
> 126 | sys::lock_error()
> | ^^^ use of undeclared crate or module `sys`
> error[E0433]: failed to resolve: use of undeclared crate or module `sys`
>--> 
> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:169:5
> |
> 169 | sys::statvfs(path.as_ref())
> | ^^^ use of undeclared crate or module `sys`
>Compiling num-rational v0.3.2
> error: aborting due to 10 previous errors
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11616) [Rust][DataFusion] Expose collect_partitioned for DataFrame

2021-02-12 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11616:
---

 Summary: [Rust][DataFusion] Expose collect_partitioned for 
DataFrame
 Key: ARROW-11616
 URL: https://issues.apache.org/jira/browse/ARROW-11616
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


The DataFrame API has a `collect` method which invokes the `collect(plan: 
Arc) -> Result>` function which will 
collect records into a single vector of RecordBatches removing the partitioning 
via `MergeExec`.

The DataFrame should also expose the `collect_partitioned` method so that 
partitions can be maintained.

```
collect_partitioned(
plan: Arc,
) -> Result>> 
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11616) [Rust][DataFusion] Expose collect_partitioned for DataFrame

2021-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11616:
---
Labels: pull-request-available  (was: )

> [Rust][DataFusion] Expose collect_partitioned for DataFrame
> ---
>
> Key: ARROW-11616
> URL: https://issues.apache.org/jira/browse/ARROW-11616
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Mike Seddon
>Assignee: Mike Seddon
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The DataFrame API has a `collect` method which invokes the `collect(plan: 
> Arc) -> Result>` function which will 
> collect records into a single vector of RecordBatches removing the 
> partitioning via `MergeExec`.
> The DataFrame should also expose the `collect_partitioned` method so that 
> partitions can be maintained.
> ```
> collect_partitioned(
> plan: Arc,
> ) -> Result>> 
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-12 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284083#comment-17284083
 ] 

Micah Kornfield commented on ARROW-11497:
-

My thought: I think the short term we can expose the flag.  We can figure out a 
longer term plan for migrating all users to a conformant writer/reader.

 

[~trucnguyenlam] do you want to to provide a PR?

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11617) [C++][Gandiva] Fix nested if-else optimisation in gandiva

2021-02-12 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-11617:
--

 Summary: [C++][Gandiva] Fix nested if-else optimisation in gandiva
 Key: ARROW-11617
 URL: https://issues.apache.org/jira/browse/ARROW-11617
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Projjal Chanda
Assignee: Projjal Chanda


{color:#1d1c1d}In gandiva, when we have nested if-else statements we reuse the 
local bitmap and treat it is a single logical if - elseif - .. - --else 
condition. However, when he have say another function between them like{color}
{color:#1d1c1d}IF{color}
{color:#1d1c1d}THEN{color}
{color:#1d1c1d}ELSE{color}
   {color:#1d1c1d}function({color}
 {color:#1d1c1d}IF{color}
 {color:#1d1c1d}THEN{color}
 {color:#1d1c1d}ELSE{color}
  {color:#1d1c1d}){color}

{color:#1d1c1d}in such cases also currently we are doing same thing which can 
lead to incorrect results{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11617) [C++][Gandiva] Fix nested if-else optimisation in gandiva

2021-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11617:
---
Labels: pull-request-available  (was: )

> [C++][Gandiva] Fix nested if-else optimisation in gandiva
> -
>
> Key: ARROW-11617
> URL: https://issues.apache.org/jira/browse/ARROW-11617
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Projjal Chanda
>Assignee: Projjal Chanda
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {color:#1d1c1d}In gandiva, when we have nested if-else statements we reuse 
> the local bitmap and treat it is a single logical if - elseif - .. - --else 
> condition. However, when he have say another function between them like{color}
> {color:#1d1c1d}IF{color}
> {color:#1d1c1d}THEN{color}
> {color:#1d1c1d}ELSE{color}
>    {color:#1d1c1d}function({color}
>  {color:#1d1c1d}IF{color}
>  {color:#1d1c1d}THEN{color}
>  {color:#1d1c1d}ELSE{color}
>   {color:#1d1c1d}){color}
> {color:#1d1c1d}in such cases also currently we are doing same thing which can 
> lead to incorrect results{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)