Hi,
The advantage of Parquet is that it only scans the required columns, it is a 
file in a column storage format. 
The fewer columns you select, the less memory is required. 
Developers do not need to care about the details of loading data, they are 
well-designed and imperceptible to users.







At 2020-04-16 11:00:32, "Yeikel" <em...@yeikel.com> wrote:
>I have a parquet file with millions of records and hundreds of fields that I
>will be extracting from a cluster with more resources. I need to take that
>data,derive a set of tables from only some of the fields and import them
>using a smaller cluster
>
>The smaller cluster cannot load in memory the entire parquet file , but it
>can load the derived tables.
>
>if I am reading a parquet file , and I only select a few fields , how much
>computing power do I need compared to all the columns? is it different?  Do
>I need more or less computing power depending on the number of columns I
>select , or does it depend more on the raw source itself and the number of
>columns it contains?
>
>One suggestion I received from a college was to derive the tables using the
>larger cluster and just import them in the smaller cluster , but I was
>wondering if that's really necessary considering that after the import , I
>won't be use the dumps anymore.
>
>I hope my question makes sense. 
>
>Thanks for your help!
>
>
>
>
>
>
>
>--
>Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
>---------------------------------------------------------------------
>To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to