Re: Storing and reading VARCHAR(255)

Jason Altekruse Thu, 18 Sep 2014 10:20:07 -0700

Disregard my last message, gmail wasn't showing me the new messages, so I
hadn't seen Julien's response.


On Thu, Sep 18, 2014 at 10:15 AM, Jason Altekruse <[email protected]>
wrote:

> Hi Pratik,
>
> You are correct that the overhead makes reading small files less feasible,
> but I wouldn't necessarily say that it the overhead is any worse than
> usual. Parquet is optimized for reading a lot of data. If the data was
> compressed the parquet reader would have been running a block decompression
> algorithm on very small lengths of bytes in your example case, a very
> non-optimal use of such an algorithm. In any case where your application is
> producing such small datasets that it wants to persist to disk, you should
> not be using Parquet, these types of files should be aggregated by some
> kind of ETL process into one large Parquet file for long term storage and
> fast analytical processing.
>
> -Jasin
>
> On Wed, Sep 17, 2014 at 6:36 PM, pratik khadloya <[email protected]>
> wrote:
>
>> Hello,
>>
>> Does anyone know if the Parquet format is generally not suited well or
>> slow
>> for reading and writing VARCHAR fields? I am currently investigating why
>> it
>> takes longer to read a parquet file which has 5 cols BIGINT(20),
>> BIGINT(20), SMALLINT(6), SMALLINT(6), VARCHAR(255) than reading a simple
>> csv file.
>>
>> For reading ALL the columns, It takes about 2ms to read a csv file vs
>> 650ms
>> for a Parquet file with the same data. There are only 700 rows in the
>> table.
>>
>> Does anyone have any information about it?
>> I suspect the overhead of parquet format is more for smaller files.
>>
>> Thanks,
>> Pratik
>>
>
>

Re: Storing and reading VARCHAR(255)

Reply via email to