[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

GitBox Wed, 15 Jul 2020 04:29:20 -0700


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658714201



   When enabling dictionary encoding for string partition fields, there are 
actually a bunch of failing tests ..
   
   Eg this one (based on `test_read_partitioned_directory`):
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.dataset as ds
   
   foo_keys = [0, 1]
   bar_keys = ['a', 'b', 'c']
   partition_spec = [
       ['foo', foo_keys],
       ['bar', bar_keys]
   ]
   N = 30
   
   df = pd.DataFrame({
       'index': np.arange(N),
       'foo': np.array(foo_keys, dtype='i4').repeat(15),
       'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
       'values': np.random.randn(N)
   }, columns=['index', 'foo', 'bar', 'values'])
   
   from pyarrow.tests.test_parquet import _generate_partition_directories
   fs = pa.filesystem.LocalFileSystem()
   _generate_partition_directories(fs, "test_partition_directories", 
partition_spec, df)
   
   # works
   ds.dataset("test_partition_directories/", partitioning="hive")
   # fails
   part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
   ds.dataset("test_partition_directories/", partitioning=part)
   ```
   
   fails with 
   
   ```
   ArrowInvalid: Dictionary supplied for field bar: dictionary<values=string, 
indices=int32, ordered=0> does not contain 'c'
   In ../src/arrow/dataset/partition.cc, line 55, code: 
(_error_or_value13).status()
   In ../src/arrow/dataset/discovery.cc, line 243, code: 
(_error_or_value16).status()
   ```
   
   Another reproducible example (based on 
`test_write_to_dataset_with_partitions`) giving a similar error:
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds 
   
   output_df = pd.DataFrame({'group1': list('aaabbbbccc'),
                               'group2': list('eefeffgeee'),
                               'num': list(range(10)),
                               'nan': [np.nan] * 10,
                               'date': np.arange('2017-01-01', '2017-01-11',
                                               dtype='datetime64[D]')})
   cols = output_df.columns.tolist()
   partition_by = ['group1', 'group2']
   output_table = pa.Table.from_pandas(output_df, safe=False,
                                       preserve_index=False)
   filesystem = pa.filesystem.LocalFileSystem() 
   base_path = "test_partition_directories2/"
   pq.write_to_dataset(output_table, base_path, partition_by,
                       filesystem=filesystem)
   
   # works
   ds.dataset("test_partition_directories2/", partitioning="hive")
   # fails
   part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
   ds.dataset("test_partition_directories2/", partitioning=part)
   ```
   
   I couldn't yet figure out what is the reason it is failing in those cases, 
though. 
   
   
   I should have tested the dictionary encoding feature more thoroughly, 
earlier, sorry about that. 
   But with the current state (unless someone can fix it today, but I don't 
have much time), it seems the choice is quite simple: merge as is without 
dictionary encoding, or delay until after 1.0
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Reply via email to