Hello,

I'm trying to read a Parquet file from disk into Arrow in memory, in Scala.
I'm wondering what the most efficient approach is, especially for the
reading part. I'm aware that Parquet reading is perhaps beyond the scope of
this mailing list but,

- I believe Arrow and Parquet are closely intertwined these days?
- I can't find an appropriate Parquet mailing list.

Any pointers would be appreciated!

Below is the code I currently have. My concern is that this alone already
takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
~=100ms in Python. So I suspect I'm not doing this in the most efficient
way possible ... The Parquet data holds 1570150 rows, with 14 columns of
various types, and takes 15MB on disk.

import org.apache.hadoop.conf.Configuration
import org.apache.parquet.column.ColumnDescriptor
import org.apache.parquet.example.data.simple.convert.GroupRecordConverter
import org.apache.parquet.format.converter.ParquetMetadataConverter
import org.apache.parquet.hadoop.{ParquetFileReader}
import org.apache.parquet.io.ColumnIOFactory

...

val path: Path = Paths.get("C:\\item.pq")
val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
val conf = new Configuration()

val readFooter = ParquetFileReader.readFooter(conf, jpath,
ParquetMetadataConverter.NO_FILTER)
val schema = readFooter.getFileMetaData.getSchema
val r = ParquetFileReader.open(conf, jpath)

val pages = r.readNextRowGroup()
val rows = pages.getRowCount

val columnIO = new ColumnIOFactory().getColumnIO(schema)
val recordReader = columnIO.getRecordReader(pages, new
GroupRecordConverter(schema))

// This takes about 2s
(1 to rows.toInt).map { i =>
  val group = recordReader.read
  // Just read first column for now ...
  val x = group.getLong(0,0)
}

...

As this will be in the hot path of my code, I'm quite keen to make it
as fast as possible. Note that the eventual objective is to build
Arrow data. I was assuming there would be a way to quickly load the
columns. I suspect the loop over the rows, building row-based records,
is causing a lot of overhead, but can't seem to find another way.


Thanks,

-J

Reply via email to