Hi, The advantage of Parquet is that it only scans the required columns, it is a file in a column storage format. The fewer columns you select, the less memory is required. Developers do not need to care about the details of loading data, they are well-designed and imperceptible to users.
At 2020-04-16 11:00:32, "Yeikel" <em...@yeikel.com> wrote: >I have a parquet file with millions of records and hundreds of fields that I >will be extracting from a cluster with more resources. I need to take that >data,derive a set of tables from only some of the fields and import them >using a smaller cluster > >The smaller cluster cannot load in memory the entire parquet file , but it >can load the derived tables. > >if I am reading a parquet file , and I only select a few fields , how much >computing power do I need compared to all the columns? is it different? Do >I need more or less computing power depending on the number of columns I >select , or does it depend more on the raw source itself and the number of >columns it contains? > >One suggestion I received from a college was to derive the tables using the >larger cluster and just import them in the smaller cluster , but I was >wondering if that's really necessary considering that after the import , I >won't be use the dumps anymore. > >I hope my question makes sense. > >Thanks for your help! > > > > > > > >-- >Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > >--------------------------------------------------------------------- >To unsubscribe e-mail: user-unsubscr...@spark.apache.org