Hello, I'm trying to read a Parquet file from disk into Arrow in memory, in Scala. I'm wondering what the most efficient approach is, especially for the reading part. I'm aware that Parquet reading is perhaps beyond the scope of this mailing list but,
- I believe Arrow and Parquet are closely intertwined these days? - I can't find an appropriate Parquet mailing list. Any pointers would be appreciated! Below is the code I currently have. My concern is that this alone already takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes ~=100ms in Python. So I suspect I'm not doing this in the most efficient way possible ... The Parquet data holds 1570150 rows, with 14 columns of various types, and takes 15MB on disk. import org.apache.hadoop.conf.Configuration import org.apache.parquet.column.ColumnDescriptor import org.apache.parquet.example.data.simple.convert.GroupRecordConverter import org.apache.parquet.format.converter.ParquetMetadataConverter import org.apache.parquet.hadoop.{ParquetFileReader} import org.apache.parquet.io.ColumnIOFactory ... val path: Path = Paths.get("C:\\item.pq") val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath) val conf = new Configuration() val readFooter = ParquetFileReader.readFooter(conf, jpath, ParquetMetadataConverter.NO_FILTER) val schema = readFooter.getFileMetaData.getSchema val r = ParquetFileReader.open(conf, jpath) val pages = r.readNextRowGroup() val rows = pages.getRowCount val columnIO = new ColumnIOFactory().getColumnIO(schema) val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema)) // This takes about 2s (1 to rows.toInt).map { i => val group = recordReader.read // Just read first column for now ... val x = group.getLong(0,0) } ... As this will be in the hot path of my code, I'm quite keen to make it as fast as possible. Note that the eventual objective is to build Arrow data. I was assuming there would be a way to quickly load the columns. I suspect the loop over the rows, building row-based records, is causing a lot of overhead, but can't seem to find another way. Thanks, -J