Re: The largest table that Parquet can support

2016-01-07 Thread Yan Qi
Sure, it is possible to change the size of row group and others. Right now
we are setting parquet-block-size to be 256M, page-size to be 1M, and
giving ~3G for Xmx.

Though my question is not directly about the sizes, as conceptually we can
always solve the problem by giving larger memory. I am trying to figure out
the right WAY to define the schema, because we have a limitation to the
memory (<5G) for JVM and too small parquet block size can compromise the
columnar storage benefits. Also it is possible for us to add more 'MARKET's
in future, making the number of table columns even larger. Therefore we
need to get some concrete ideas of the memory consumption by Parquet itself
(e.g., Parquet requires an internal structure to keep the table schema, I
suppose).

Any suggestions?

Thanks,
Yan

On Thu, Jan 7, 2016 at 11:14 AM, Reuben Kuhnert  wrote:

> Hi again Yan,
>
> Sorry about the late reply, the *ParquetOutputFormat* class has a number of
> setters:
>
>   public static void setBlockSize(Job job, int blockSize) {
> getConfiguration(job).setInt(BLOCK_SIZE, blockSize);
>   }
>
>   public static void setPageSize(Job job, int pageSize) {
> getConfiguration(job).setInt(PAGE_SIZE, pageSize);
>   }
>
>   public static void setDictionaryPageSize(Job job, int pageSize) {
> getConfiguration(job).setInt(DICTIONARY_PAGE_SIZE, pageSize);
>   }
>
>   public static void setCompression(Job job, CompressionCodecName
> compression) {
> getConfiguration(job).set(COMPRESSION, compression.name());
>   }
>
>   public static void setEnableDictionary(Job job, boolean enableDictionary)
> {
> getConfiguration(job).setBoolean(ENABLE_DICTIONARY, enableDictionary);
>   }
>
> these allow you to set the 'row group' (i.e. block size) and page size
> which determine how much data is written out per block (and transitively
> how much data is retained in memory before a flush). Try setting these to
> say '128 M' for a block and '1 MB' for a page (to test). If this doesn't
> work can you let us know what the current sizes are that you're using
> (using the associated getters also on ParquetOutputFormat)?
>
> Thanks
>
> On Wed, Jan 6, 2016 at 4:23 PM, Yan Qi  wrote:
>
> > Hi Reuben,
> >
> > Thanks for your quick reply! :)
> >
> > The table has nested columns with the following Avro schema:
> >
> > {
> > "namespace": "profile.avro.parquet.model",
> > "type": "record",
> > "name": "Profile",
> > "fields": [
> > {"name": "id", "type": "int"},
> > {"name": "M1", "type": ["Market", "null"]},
> > {"name": "M2", "type": ["Market", "null"]},
> > ...
> > ...
> > {"name": "M100", "type": ["Market", "null"]}
> > ]
> > }
> >
> > {
> > "namespace": "profile.avro.parquet.model",
> > "type": "record",
> > "name": "Market",
> > "fields": [
> > {"name": "item1", "type": [{ "type": "array", "items": "Client"},
> "null"]},
> > {"name": "item2", "type": [{ "type": "array", "items": "Client"},
> "null"]},
> > {"name": "item3", "type": [{ "type": "array", "items": "Client"},
> "null"]},
> > {"name": "item4", "type": [{ "type": "array", "items": "Client"},
> "null"]},
> > {"name": "item5", "type": [{ "type": "array", "items": "Client"},
> "null"]}
> > ]
> > }
> >
> > {
> > "namespace": "profile.avro.parquet.model",
> > "type": "record",
> > "name": "Client",
> > "fields": [
> > {"name": "attribute1", "type": "int"},
> > {"name": "attribute2", "type": "int"},
> > {"name": "attribute3", "type": "int"},
> > ..
> > ..
> > {"name": "attribute50", "type": "int"}
> > ]
> > }
> >
> > For each record in the table, it may not have every attribute valid. For
> > example, a record of Profile may only have M1, M20 and M89 with values,
> but
> > other empty. When we tried to write such a record in the parquet format,
> it
> > requires a lot of memory to get started.
> >
> > We also tried another way to define the table, like:
> >
> > {
> > "namespace": "profile.avro.parquet.model",
> > "type": "record",
> > "name": "Profile",
> > "fields": [
> > {"name": "id", "type": "int"},
> > {"name": "markets", "type": [{ "type": "array", "items":
> > "Market"}, "null"]},
> > ]
> > }
> >
> > Interestingly it can handle the same data with much smaller memory. But
> we
> > won't be able to get the columnar storage benefits for those Market
> members
> > because we have to load data from all markets no matter what market is
> > concerned.
> >
> > Hope my information could give you a rough idea of the application. So my
> > question is if increasing the memory size is the only way in the former
> > case, or if there is a better way to define the table.
> >
> > Best regards,
> >
> > Yan
> >
> >
> >
> > On Wed, Jan 6, 2016 at 12:03 PM, Reuben Kuhnert <
> > reuben.kuhn...@cloudera.com
> > > wrote:
> >
> > > Hi Yan,
> > >
> > > So the primary concern here would be the 'row group' size that you're
> > using
> > > for your

Re: The largest table that Parquet can support

2016-01-07 Thread Reuben Kuhnert
Hi again Yan,

Sorry about the late reply, the *ParquetOutputFormat* class has a number of
setters:

  public static void setBlockSize(Job job, int blockSize) {
getConfiguration(job).setInt(BLOCK_SIZE, blockSize);
  }

  public static void setPageSize(Job job, int pageSize) {
getConfiguration(job).setInt(PAGE_SIZE, pageSize);
  }

  public static void setDictionaryPageSize(Job job, int pageSize) {
getConfiguration(job).setInt(DICTIONARY_PAGE_SIZE, pageSize);
  }

  public static void setCompression(Job job, CompressionCodecName
compression) {
getConfiguration(job).set(COMPRESSION, compression.name());
  }

  public static void setEnableDictionary(Job job, boolean enableDictionary)
{
getConfiguration(job).setBoolean(ENABLE_DICTIONARY, enableDictionary);
  }

these allow you to set the 'row group' (i.e. block size) and page size
which determine how much data is written out per block (and transitively
how much data is retained in memory before a flush). Try setting these to
say '128 M' for a block and '1 MB' for a page (to test). If this doesn't
work can you let us know what the current sizes are that you're using
(using the associated getters also on ParquetOutputFormat)?

Thanks

On Wed, Jan 6, 2016 at 4:23 PM, Yan Qi  wrote:

> Hi Reuben,
>
> Thanks for your quick reply! :)
>
> The table has nested columns with the following Avro schema:
>
> {
> "namespace": "profile.avro.parquet.model",
> "type": "record",
> "name": "Profile",
> "fields": [
> {"name": "id", "type": "int"},
> {"name": "M1", "type": ["Market", "null"]},
> {"name": "M2", "type": ["Market", "null"]},
> ...
> ...
> {"name": "M100", "type": ["Market", "null"]}
> ]
> }
>
> {
> "namespace": "profile.avro.parquet.model",
> "type": "record",
> "name": "Market",
> "fields": [
> {"name": "item1", "type": [{ "type": "array", "items": "Client"}, "null"]},
> {"name": "item2", "type": [{ "type": "array", "items": "Client"}, "null"]},
> {"name": "item3", "type": [{ "type": "array", "items": "Client"}, "null"]},
> {"name": "item4", "type": [{ "type": "array", "items": "Client"}, "null"]},
> {"name": "item5", "type": [{ "type": "array", "items": "Client"}, "null"]}
> ]
> }
>
> {
> "namespace": "profile.avro.parquet.model",
> "type": "record",
> "name": "Client",
> "fields": [
> {"name": "attribute1", "type": "int"},
> {"name": "attribute2", "type": "int"},
> {"name": "attribute3", "type": "int"},
> ..
> ..
> {"name": "attribute50", "type": "int"}
> ]
> }
>
> For each record in the table, it may not have every attribute valid. For
> example, a record of Profile may only have M1, M20 and M89 with values, but
> other empty. When we tried to write such a record in the parquet format, it
> requires a lot of memory to get started.
>
> We also tried another way to define the table, like:
>
> {
> "namespace": "profile.avro.parquet.model",
> "type": "record",
> "name": "Profile",
> "fields": [
> {"name": "id", "type": "int"},
> {"name": "markets", "type": [{ "type": "array", "items":
> "Market"}, "null"]},
> ]
> }
>
> Interestingly it can handle the same data with much smaller memory. But we
> won't be able to get the columnar storage benefits for those Market members
> because we have to load data from all markets no matter what market is
> concerned.
>
> Hope my information could give you a rough idea of the application. So my
> question is if increasing the memory size is the only way in the former
> case, or if there is a better way to define the table.
>
> Best regards,
>
> Yan
>
>
>
> On Wed, Jan 6, 2016 at 12:03 PM, Reuben Kuhnert <
> reuben.kuhn...@cloudera.com
> > wrote:
>
> > Hi Yan,
> >
> > So the primary concern here would be the 'row group' size that you're
> using
> > for your table. The row group is basically what determines how much
> > information is stored in memory before being flushed to disk (this
> becomes
> > an even greater issue if you have multiple parquet files open
> > simultaneously as well - obviously). If you could, can you share some of
> > the stats about your file with us? See if we can't get you moving again.
> >
> > Thanks
> > Reuben
> >
> > On Wed, Jan 6, 2016 at 1:54 PM, Yan Qi  wrote:
> >
> > > We are trying to create a large table in Parquet. The table has up to
> > > thousands of columns, but its record may not be large because many of
> the
> > > columns are empty. We are using Avro-Parquet for data
> > > serialization/de-serialization. However, we got out-of-memory issue
> when
> > > writing the data in the Parquet format.
> > >
> > > Our understanding is that Parquet may keep an internal structure for
> the
> > > table schema, which may take more memory if the table becomes larger.
> If
> > > that's the case, our question is:
> > >
> > > Is there a limit to the table size that Parquet can support? If yes,
> how
> > > could we determine the limit?

[jira] [Updated] (PARQUET-416) C++11, cpplint cleanup, package target and header installation

2016-01-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-416:
-
Description: 
I'm planning to work on building out parquet-cpp with columnar data structures 
(see Arrow proposal) for materialized in-memory data and feature complete 
reader/writers so that native-code consumers like Python can finally read and 
write Parquet files at native speeds. It would be great to have all this 
officially a part of Apache Parquet. 

This adds minimal support to be able to install the resulting libparquet.so and 
its various header files to support minimally viable development on downstream 
C++ and Python projects that will need to depend on this. It also builds in 
C++11 mode and passes Google's cpplint.

  was:
I'm planning to work on building out parquet-cpp with columnar data structures 
(see Arrow proposal) for materialized in-memory data and feature complete 
reader/writers so that native-code consumers like Python can finally read and 
write Parquet files at native speeds. It would be great to have all this 
officially a part of Apache Parquet. 

To that end, I have removed the thirdparty libraries and added optional support 
for the open source external C++ toolchain available at 
github.com/cloudera/native-toolchain. 

This also adds minimal support to be able to install the resulting 
libparquet.so and its various header files to support minimally viable 
development on downstream C++ and Python projects that will need to depend on 
this. 


> C++11, cpplint cleanup, package target and header installation
> --
>
> Key: PARQUET-416
> URL: https://issues.apache.org/jira/browse/PARQUET-416
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>
> I'm planning to work on building out parquet-cpp with columnar data 
> structures (see Arrow proposal) for materialized in-memory data and feature 
> complete reader/writers so that native-code consumers like Python can finally 
> read and write Parquet files at native speeds. It would be great to have all 
> this officially a part of Apache Parquet. 
> This adds minimal support to be able to install the resulting libparquet.so 
> and its various header files to support minimally viable development on 
> downstream C++ and Python projects that will need to depend on this. It also 
> builds in C++11 mode and passes Google's cpplint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)