Disregard my last message, gmail wasn't showing me the new messages, so I hadn't seen Julien's response.
On Thu, Sep 18, 2014 at 10:15 AM, Jason Altekruse <[email protected]> wrote: > Hi Pratik, > > You are correct that the overhead makes reading small files less feasible, > but I wouldn't necessarily say that it the overhead is any worse than > usual. Parquet is optimized for reading a lot of data. If the data was > compressed the parquet reader would have been running a block decompression > algorithm on very small lengths of bytes in your example case, a very > non-optimal use of such an algorithm. In any case where your application is > producing such small datasets that it wants to persist to disk, you should > not be using Parquet, these types of files should be aggregated by some > kind of ETL process into one large Parquet file for long term storage and > fast analytical processing. > > -Jasin > > On Wed, Sep 17, 2014 at 6:36 PM, pratik khadloya <[email protected]> > wrote: > >> Hello, >> >> Does anyone know if the Parquet format is generally not suited well or >> slow >> for reading and writing VARCHAR fields? I am currently investigating why >> it >> takes longer to read a parquet file which has 5 cols BIGINT(20), >> BIGINT(20), SMALLINT(6), SMALLINT(6), VARCHAR(255) than reading a simple >> csv file. >> >> For reading ALL the columns, It takes about 2ms to read a csv file vs >> 650ms >> for a Parquet file with the same data. There are only 700 rows in the >> table. >> >> Does anyone have any information about it? >> I suspect the overhead of parquet format is more for smaller files. >> >> Thanks, >> Pratik >> > >
