[ https://issues.apache.org/jira/browse/SPARK-25438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-25438: ---------------------------------- Description: This issue aims to fix three things in `FilterPushdownBenchmark`. 1. Use the same memory assumption. The following configurations are used in ORC and Parquet. *Memory buffer for writing* - parquet.block.size (default: 128MB) - orc.stripe.size (default: 64MB) *Compression chunk size* - parquet.page.size (default: 1MB) - orc.compress.size (default: 256KB) SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compress.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent. 2. Dictionary encoding should not be enforced for all cases. SPARK-24206 enforced dictionary encoding for all test cases. This issue recovers the ORC behavior in general and enforces dictionary encoding only for `prepareStringDictTable`. 3. Generate test result on AWS r3.xlarge. We do not SPARK-24206 generates the result on AWS in order to reproduce and compare easily. This issue also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. was: This issue aims to fix three things in `FilterPushdownBenchmark`. 1. Use the same memory assumption. The following configurations are used in ORC and Parquet. *Memory buffer for writing* - parquet.block.size (default: 128MB) - orc.stripe.size (default: 64MB) *Compression chunk size* - parquet.page.size (default: 1MB) - orc.compress.size (default: 256KB) SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compression.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent. 2. Dictionary encoding should not be enforced for all cases. SPARK-24206 enforced dictionary encoding for all test cases. This issue recovers the ORC behavior in general and enforces dictionary encoding only for `prepareStringDictTable`. 3. Generate test result on AWS r3.xlarge. We do not SPARK-24206 generates the result on AWS in order to reproduce and compare easily. This issue also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. > Fix FilterPushdownBenchmark to use the same memory assumption > ------------------------------------------------------------- > > Key: SPARK-25438 > URL: https://issues.apache.org/jira/browse/SPARK-25438 > Project: Spark > Issue Type: Bug > Components: SQL, Tests > Affects Versions: 2.4.0 > Reporter: Dongjoon Hyun > Priority: Major > > This issue aims to fix three things in `FilterPushdownBenchmark`. > 1. Use the same memory assumption. > The following configurations are used in ORC and Parquet. > *Memory buffer for writing* > - parquet.block.size (default: 128MB) > - orc.stripe.size (default: 64MB) > *Compression chunk size* > - parquet.page.size (default: 1MB) > - orc.compress.size (default: 256KB) > SPARK-24692 used 1MB, the default value of `parquet.page.size`, for > `parquet.block.size` and `orc.stripe.size`. But, it missed to match > `orc.compress.size`. So, the current benchmark shows the result from ORC with > 256KB memory for compression and Parquet with 1MB. To compare correctly, we > need to be consistent. > 2. Dictionary encoding should not be enforced for all cases. > SPARK-24206 enforced dictionary encoding for all test cases. This issue > recovers the ORC behavior in general and enforces dictionary encoding only > for `prepareStringDictTable`. > 3. Generate test result on AWS r3.xlarge. > We do not > SPARK-24206 generates the result on AWS in order to reproduce and compare > easily. This issue also aims to update the result on the same machine again > in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org