Thanks for your response Jason. That helps too. Regards, ~Pratik
On Thu, Sep 18, 2014 at 10:19 AM, Jason Altekruse <[email protected]> wrote: > Disregard my last message, gmail wasn't showing me the new messages, so I > hadn't seen Julien's response. > > On Thu, Sep 18, 2014 at 10:15 AM, Jason Altekruse < > [email protected]> > wrote: > > > Hi Pratik, > > > > You are correct that the overhead makes reading small files less > feasible, > > but I wouldn't necessarily say that it the overhead is any worse than > > usual. Parquet is optimized for reading a lot of data. If the data was > > compressed the parquet reader would have been running a block > decompression > > algorithm on very small lengths of bytes in your example case, a very > > non-optimal use of such an algorithm. In any case where your application > is > > producing such small datasets that it wants to persist to disk, you > should > > not be using Parquet, these types of files should be aggregated by some > > kind of ETL process into one large Parquet file for long term storage and > > fast analytical processing. > > > > -Jasin > > > > On Wed, Sep 17, 2014 at 6:36 PM, pratik khadloya <[email protected]> > > wrote: > > > >> Hello, > >> > >> Does anyone know if the Parquet format is generally not suited well or > >> slow > >> for reading and writing VARCHAR fields? I am currently investigating why > >> it > >> takes longer to read a parquet file which has 5 cols BIGINT(20), > >> BIGINT(20), SMALLINT(6), SMALLINT(6), VARCHAR(255) than reading a simple > >> csv file. > >> > >> For reading ALL the columns, It takes about 2ms to read a csv file vs > >> 650ms > >> for a Parquet file with the same data. There are only 700 rows in the > >> table. > >> > >> Does anyone have any information about it? > >> I suspect the overhead of parquet format is more for smaller files. > >> > >> Thanks, > >> Pratik > >> > > > > >
