[GitHub] [arrow] NarayanB commented on issue #12416: Parquet Partition issues with Int64 Null

GitBox Thu, 17 Feb 2022 05:50:20 -0800


NarayanB commented on issue #12416:
URL: https://github.com/apache/arrow/issues/12416#issuecomment-1042969321



   import pandas as pd
   import polars as pl
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   def read_csv(csv_file):
        print("reading csv")
        csv = """pid1,pid2,intCol,strCol
        2010,10000,23455555508999,Peaceful
        2015,15000,7753285016841556620,Happy
        2020,25000,,World""".encode()
   
        good_df = (pl.read_csv(csv, dtypes={"intCol": pl.Utf8})
              .with_column(pl.col("intCol").str.replace("", "0").cast(pl.Int64))
              )
        bad_df = pl.read_csv(csv)
        df = good_df
        #df = bad_df
        print(df.head(10))
        table = df.to_arrow()
        print('Table Schema..\n',table.schema)
        return table
   
   def save_table(table, location):
       pq.write_to_dataset(table, location, partition_cols=['pid1','pid2'])
   
   def read_table(location):
       schema = pa.schema([ ('pid1', pa.int64()),
                            ('pid2', pa.int64())
   
       ])
       partition = ds.partitioning(schema=schema, flavor='hive')
       dataset = ds.dataset(location, partitioning=partition)
       table = dataset.to_table()
       print("Retrived table schema\n", table)
       df = pl.from_arrow(table)
       print(df.head(10))
   
   table = read_csv(None)
   save_table(table, '../data')
   read_table('../data')
   
   Please run this program twice once with df = good_df and next time df = 
bad_df. You will see the issue that the int64 value toggling between 
7753285016841556620 vs 7753285016841556992. You will have to clear the 
partition data for each run
   
   I think I issue is that for some very big int64 values having null in one 
field column creates the issue. There is no pandas involved here. I haven't 
seen the issue if the int64 field value is something like 10000 or 2500001 etc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] NarayanB commented on issue #12416: Parquet Partition issues with Int64 Null

Reply via email to