Re: Rdd - zip with index

Mich Talebzadeh Thu, 25 Mar 2021 05:53:45 -0700

Pretty easy if you do it efficiently


gunzip --to-stdout csvfile.gz | bzip2 > csvfile.bz2


Just create a simple bash file to do it and print timings

cat convert_file.sh

#!/bin/bash
GZFILE="csvfile.gz"
FILE_NAME=`basename $GZFILE .gz`
BZFILE="$FILE_NAME.bz2"
echo `date` " ""=======  Started compressing file $GZFILE  ======"
gunzip --to-stdout $GZFILE | bzip2 > $BZFILE
if [ $? != 0 ]
then
  echo `date` " ""======= Could not process GZFILE, aborting ======"
  exit 1
else
  echo `date` " ""======= bz2 file $BZFILE created OK ======"
  ls -ltr $BZFILE
  exit 0
fi

./convert_file.sh
Thu Mar 25 12:51:55 GMT 2021  =======  Started compressing file csvfile.gz
======
Thu Mar 25 12:52:00 GMT 2021  ======= bz2 file csvfile.bz2 created OK ======
-rw-r--r-- 1 hduser hadoop 4558525 Mar 25 12:52 csvfile.bz2

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 25 Mar 2021 at 12:16, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
wrote:

> Hi Mich,
>
> Yes you are right. We were getting gz files and this is causing the issue.
> I will be changing it to bzip or other splittable formats and try running
> it again today.
>
> Thanks,
> Asmath
>
> Sent from my iPhone
>
> On Mar 25, 2021, at 6:51 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> 
> Hi Asmath,
>
> Have you actually managed to run this single file? Because Spark (as
> brought up a few times already) will pull the whole of the GZ file in a
> single partition in the driver, and can get an out of memory error.
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 24 Mar 2021 at 01:19, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi,
>>
>> I have 10gb file that should be loaded into spark dataframe. This file is
>> csv with header and we were using rdd.zipwithindex to get column names and
>> convert to avro accordingly.
>>
>> I am assuming this is taking long time and only executor runs and never
>> achieves parallelism. Is there a easy way to achieve parallelism after
>> filtering out the header.
>>
>> I am
>> Also interested in solution that can remove header from the file and I
>> can give my own schema. This way I can split the files.
>>
>> Rdd.partitions is always 1 for this even after repartitioning the
>> dataframe after zip with index . Any help on this topic please .
>>
>> Thanks,
>> Asmath
>>
>> Sent from my iPhone
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Re: Rdd - zip with index

Reply via email to