jpdonasolo opened a new issue, #45106:
URL: https://github.com/apache/arrow/issues/45106
### Describe the bug, including details regarding any error messages,
version, and platform.
If I download the file from s3 to my machine, I can read it using pandas:
```python
>>> df = pd.read_parquet(my_file)
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77255 entries, 0 to 77254
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 doc_id 77255 non-null Int64
1 price_id 77255 non-null Int64
2 ean 77255 non-null object
3 preco_liquido 77255 non-null float64
4 rede 77255 non-null object
5 cnpj 77255 non-null object
6 tipo 77255 non-null object
7 codestado 77255 non-null Int64
8 codcidade 77255 non-null Int64
9 descricao 77255 non-null object
10 desconto 77205 non-null float64
11 endereco_logradouro 77255 non-null object
12 endereco_numero 77255 non-null object
13 endereco_complemento 77255 non-null object
14 cep 77255 non-null object
15 bairro 77255 non-null object
16 latitude 77255 non-null float64
17 longitude 77255 non-null float64
18 nome_fantasia 77255 non-null object
19 telefone 77255 non-null object
20 segmento 77255 non-null object
21 fonte_coleta 77255 non-null object
22 pmc 0 non-null float64
23 composicao 77213 non-null object
dtypes: Int64(4), float64(5), object(15)
memory usage: 14.4+ MB
```
However, if I try to read it directly from s3, I get a type error:
```python
df = pd.read_parquet(f"s3://{bucket}/{my_file}",
storage_options=dict(profile=aws_profile))
```
I get the following error:
```python
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pandas/io/parquet.py",
line 667, in read_parquet
return impl.read(
^^^^^^^^^^
File
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pandas/io/parquet.py",
line 274, in read
pa_table = self.api.parquet.read_table(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py",
line 1793, in read_table
dataset = ParquetDataset(
^^^^^^^^^^^^^^^
File
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py",
line 1371, in __init__
self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/dataset.py",
line 794, in dataset
return _filesystem_dataset(source, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/dataset.py",
line 486, in _filesystem_dataset
return factory.finish(schema)
^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 3126, in
pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unable to merge: Field tipo has incompatible
types: string vs dictionary<values=string, indices=int32, ordered=0>
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]