Re: Storing and reading VARCHAR(255)

Jason Altekruse Thu, 18 Sep 2014 10:16:58 -0700

Hi Pratik,

You are correct that the overhead makes reading small files less feasible,
but I wouldn't necessarily say that it the overhead is any worse than
usual. Parquet is optimized for reading a lot of data. If the data was
compressed the parquet reader would have been running a block decompression
algorithm on very small lengths of bytes in your example case, a very
non-optimal use of such an algorithm. In any case where your application is
producing such small datasets that it wants to persist to disk, you should
not be using Parquet, these types of files should be aggregated by some
kind of ETL process into one large Parquet file for long term storage and
fast analytical processing.


-Jasin

On Wed, Sep 17, 2014 at 6:36 PM, pratik khadloya <[email protected]>
wrote:

> Hello,
>
> Does anyone know if the Parquet format is generally not suited well or slow
> for reading and writing VARCHAR fields? I am currently investigating why it
> takes longer to read a parquet file which has 5 cols BIGINT(20),
> BIGINT(20), SMALLINT(6), SMALLINT(6), VARCHAR(255) than reading a simple
> csv file.
>
> For reading ALL the columns, It takes about 2ms to read a csv file vs 650ms
> for a Parquet file with the same data. There are only 700 rows in the
> table.
>
> Does anyone have any information about it?
> I suspect the overhead of parquet format is more for smaller files.
>
> Thanks,
> Pratik
>

Re: Storing and reading VARCHAR(255)

Reply via email to