Hello Group

I am having issues setting the stripe size, index stride and index on an orc 
file using PySpark.  I am getting approx 2000 stripes for the 1.2GB file when I 
am expecting only 5 stripes for the 256MB setting.

Tried the below options

1. Set the .options on data frame writer. The compression setting in .option 
worked but no other .option setting worked. Research the .option method in 
Dataframe class and it has only for compression and not for the stripe, index, 
and stride.

df.\
.repartition(custom field)\
.sortWithPartitions(custom field, sort field 1 , sort field 2)\
.write.format(orc)\
.option("compression","zlib")\                 only this option worked
.option("preserveSortOrder","true")\
.options("orc.stripe.size","268435456")\
.option("orc.row.index.stride","true")\
.options("orc.create.index","true")\
.save(s3 location )


2. Created an empty HIVE table with above ORC setting and loaded into the table 
using spark  SaveAsTable and insertInto method. The resultant table had more 
stripes than anticipated

df.\
.repartition(custom field)\
.sortWithPartitions(custom field, sort field 1 , sort field 2)\
.write.format(orc)\
.mode("apped")
.saveAsTable(hive tablename )    & tried .insertInto (hive table name)


For both the option I have enabled the below configs

spark.sql("set spark.sql.orc.impl=native")
spark.sql("set spark.sql.orc.enabled=true")
spark.sql("set spark.sql.orc.cache.stripe.details.size=" 268435456  ")

Please let me know if there are any missing piece of code or data frame writer 
level methods or Spark session level config that would enable us to get the 
desired results.


Reply via email to