jpdonasolo opened a new issue, #45106:
URL: https://github.com/apache/arrow/issues/45106

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   If I download the file from s3 to my machine, I can read it using pandas:
   
   ```python
   >>> df = pd.read_parquet(my_file)
   >>> df.info()
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 77255 entries, 0 to 77254
   Data columns (total 24 columns):
    #   Column                Non-Null Count  Dtype  
   ---  ------                --------------  -----  
    0   doc_id                77255 non-null  Int64  
    1   price_id              77255 non-null  Int64  
    2   ean                   77255 non-null  object 
    3   preco_liquido         77255 non-null  float64
    4   rede                  77255 non-null  object 
    5   cnpj                  77255 non-null  object 
    6   tipo                  77255 non-null  object 
    7   codestado             77255 non-null  Int64  
    8   codcidade             77255 non-null  Int64  
    9   descricao             77255 non-null  object 
    10  desconto              77205 non-null  float64
    11  endereco_logradouro   77255 non-null  object 
    12  endereco_numero       77255 non-null  object 
    13  endereco_complemento  77255 non-null  object 
    14  cep                   77255 non-null  object 
    15  bairro                77255 non-null  object 
    16  latitude              77255 non-null  float64
    17  longitude             77255 non-null  float64
    18  nome_fantasia         77255 non-null  object 
    19  telefone              77255 non-null  object 
    20  segmento              77255 non-null  object 
    21  fonte_coleta          77255 non-null  object 
    22  pmc                   0 non-null      float64
    23  composicao            77213 non-null  object 
   dtypes: Int64(4), float64(5), object(15)
   memory usage: 14.4+ MB
   ```
   
   However, if I try to read it directly from s3, I get a type error:
   ```python
   df = pd.read_parquet(f"s3://{bucket}/{my_file}", 
storage_options=dict(profile=aws_profile))
   ```
   
   I get the following error:
   ```python
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File 
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pandas/io/parquet.py",
 line 667, in read_parquet
       return impl.read(
              ^^^^^^^^^^
     File 
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pandas/io/parquet.py",
 line 274, in read
       pa_table = self.api.parquet.read_table(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py",
 line 1793, in read_table
       dataset = ParquetDataset(
                 ^^^^^^^^^^^^^^^
     File 
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py",
 line 1371, in __init__
       self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/dataset.py",
 line 794, in dataset
       return _filesystem_dataset(source, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/dataset.py",
 line 486, in _filesystem_dataset
       return factory.finish(schema)
              ^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 3126, in 
pyarrow._dataset.DatasetFactory.finish
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowTypeError: Unable to merge: Field tipo has incompatible 
types: string vs dictionary<values=string, indices=int32, ordered=0>
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to