[jira] [Commented] (PARQUET-1982) Allow random access to row groups in ParquetFileReader

ASF GitHub Bot (Jira) Tue, 13 Apr 2021 06:12:21 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320168#comment-17320168
 ]


ASF GitHub Bot commented on PARQUET-1982:
-----------------------------------------

fschmalzel commented on pull request #871:
URL: https://github.com/apache/parquet-mr/pull/871#issuecomment-818724064


   Tested random access with a 351 MB parquet file with zstd and dictionary 
enabled. The file has 321 row groups.
   Using the sequential method to read the randomized indexes took around 87319 
milliseconds.
   Using the sequential method without this commit took around 86311 
milliseconds.
   Using the random access method took around 58941 milliseconds.
   
   Note: This only improves randomized access. For sequential access both 
methods should take about the same time. Nevertheless the sequential method is 
not removed, so that should not be a problem.
   
   This should all be taken with a grain of salt, as this was tested without 
something like jmh. 
https://github.com/openjdk/jmh#java-microbenchmark-harness-jmh
   
   Unfortunately i cannot upload the tested file somewhere. If you want me to 
test a different file i will gladly do so.
   
   Quick and dirty test:
   
   ```java
   package org.apache.parquet;
   
   import org.apache.hadoop.conf.Configuration;
   import org.apache.hadoop.fs.Path;
   import org.apache.parquet.column.page.PageReadStore;
   import org.apache.parquet.example.data.Group;
   import org.apache.parquet.example.data.simple.convert.GroupRecordConverter;
   import org.apache.parquet.hadoop.ParquetFileReader;
   import org.apache.parquet.hadoop.util.HadoopInputFile;
   import org.apache.parquet.io.ColumnIOFactory;
   import org.apache.parquet.io.MessageColumnIO;
   import org.apache.parquet.io.RecordReader;
   import org.apache.parquet.schema.MessageType;
   import org.junit.Test;
   
   import java.io.IOException;
   import java.util.ArrayList;
   import java.util.Collections;
   import java.util.List;
   import java.util.Random;
   
   public class ReadTest {
   
     private static final long RANDOM_SEED = 7174252115631550700L;
     private static final String FILE_PATH = 
"C:\\Users\\fes\\Desktop\\test.parquet";
     private static final int ROUNDS = 4;
   
     @Test
     public void testSequential() throws IOException {
       Random random = new Random(RANDOM_SEED);
   
       Path file = new Path(FILE_PATH);
       Configuration configuration = new Configuration();
       ParquetReadOptions options = ParquetReadOptions.builder().build();
   
       ParquetFileReader reader = new 
ParquetFileReader(HadoopInputFile.fromPath(file, configuration), options);
       MessageType schema = reader.getFileMetaData().getSchema();
       MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
       GroupRecordConverter converter = new GroupRecordConverter(schema);
   
       int blockAmount = reader.getRowGroups().size();
   
       List<Integer> indexes = new ArrayList<>();
       for (int j = 0; j < ROUNDS; j++) {
         for (int i = 0; i < blockAmount; i++) {
           indexes.add(i);
         }
       }
   
       Collections.shuffle(indexes, random);
   
       long start = System.currentTimeMillis();
   
       int currentRowGroup = -1;
   
       for (int index : indexes) {
         if (index <= currentRowGroup) {
           try {
             reader.close();
           } catch (Exception ignored) {
           }
   
           reader = new ParquetFileReader(HadoopInputFile.fromPath(file, 
configuration), options);
           currentRowGroup = -1;
         }
         for (int i = 1; i < index - currentRowGroup; i++) {
           reader.skipNextRowGroup();
         }
         currentRowGroup = index;
         PageReadStore pages = reader.readNextRowGroup();
         long rowCount = pages.getRowCount();
         RecordReader<Group> recordReader = columnIO.getRecordReader(pages, 
converter);
         for (long i = 0; i < rowCount; i++) {
           recordReader.read();
         }
       }
   
       long stop = System.currentTimeMillis();
   
       try {
         reader.close();
       } catch (Exception ignored) {
       }
   
       long timeTaken = stop - start;
   
       System.out.printf("Sequential access took %d milliseconds%n", timeTaken);
     }
   
     @Test
     public void testRandom() throws IOException {
       Random random = new Random(RANDOM_SEED);
   
       Path file = new Path(FILE_PATH);
       Configuration configuration = new Configuration();
       ParquetReadOptions options = ParquetReadOptions.builder().build();
   
       ParquetFileReader reader = new 
ParquetFileReader(HadoopInputFile.fromPath(file, configuration), options);
       MessageType schema = reader.getFileMetaData().getSchema();
       MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
       GroupRecordConverter converter = new GroupRecordConverter(schema);
   
       int blockAmount = reader.getRowGroups().size();
   
       List<Integer> indexes = new ArrayList<>();
       for (int j = 0; j < ROUNDS; j++) {
         for (int i = 0; i < blockAmount; i++) {
           indexes.add(i);
         }
       }
   
       Collections.shuffle(indexes, random);
   
       long start = System.currentTimeMillis();
       for (int index : indexes) {
         PageReadStore pages = reader.readRowGroup(index);
         long rowCount = pages.getRowCount();
         RecordReader<Group> recordReader = columnIO.getRecordReader(pages, 
converter);
         for (long i = 0; i < rowCount; i++) {
           recordReader.read();
         }
       }
   
       long stop = System.currentTimeMillis();
   
       try {
         reader.close();
       } catch (Exception ignored) {
       }
   
       long timeTaken = stop - start;
   
       System.out.printf("Random access took %d milliseconds%n", timeTaken);
     }
     
   }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> Allow random access to row groups in ParquetFileReader
> ------------------------------------------------------
>
>                 Key: PARQUET-1982
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1982
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Felix Schmalzel
>            Priority: Minor
>              Labels: parquetReader, random-access
>
> The used SeekableInputStream and all other components of the 
> ParquetFileReader already support random access and row groups should be 
> independent of each other.
> This would allow reusing the opened InputStream when you want to go back a 
> row group. It also makes accessing specific row groups a lot easier.
> I've already developed a patch that would enable this functionality. I will 
> link the merge request in the next few days.
> Is there a related ticket that i have overlooked?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1982) Allow random access to row groups in ParquetFileReader

Reply via email to