Hi Pratik, You are correct that the overhead makes reading small files less feasible, but I wouldn't necessarily say that it the overhead is any worse than usual. Parquet is optimized for reading a lot of data. If the data was compressed the parquet reader would have been running a block decompression algorithm on very small lengths of bytes in your example case, a very non-optimal use of such an algorithm. In any case where your application is producing such small datasets that it wants to persist to disk, you should not be using Parquet, these types of files should be aggregated by some kind of ETL process into one large Parquet file for long term storage and fast analytical processing.
-Jasin On Wed, Sep 17, 2014 at 6:36 PM, pratik khadloya <[email protected]> wrote: > Hello, > > Does anyone know if the Parquet format is generally not suited well or slow > for reading and writing VARCHAR fields? I am currently investigating why it > takes longer to read a parquet file which has 5 cols BIGINT(20), > BIGINT(20), SMALLINT(6), SMALLINT(6), VARCHAR(255) than reading a simple > csv file. > > For reading ALL the columns, It takes about 2ms to read a csv file vs 650ms > for a Parquet file with the same data. There are only 700 rows in the > table. > > Does anyone have any information about it? > I suspect the overhead of parquet format is more for smaller files. > > Thanks, > Pratik >
