Working with a text file that is both compressed by bz2 followed by zip in PySpark

Mich Talebzadeh Mon, 04 Mar 2024 10:00:18 -0800

I have downloaded Amazon reviews for sentiment analysis from here. The file
is not particularly large (just over 500MB) but comes in the following
format


test.ft.txt.bz2.zip

So it is a text file that is compressed by bz2 followed by zip. Now I like
tro do all these operations in PySpark. In PySpark a file cannot have both
.bz2 and .zip simultaneously..

The way I do it is to  place the downloaded file in a local directory. Then
just do some operations that are simple but messy.. I try to unzip the file
using zipfile package. This works with bash stype filename. as opposed to
python style filename "file:///.." This necessitates using different style,
one for OS type for zip and the other Python style to read bz2 file
directory into df in Pyspark

import os
import zipfile
data_path = "file:///d4T/hduser/sentiments/"
input_file_path = os.path.join(data_path, "test.ft.txt.bz2")
output_file_path = os.path.join(data_path, "review_text_file")
dir_name = "/d4T/hduser/sentiments/"
zipped_file=os.path.join(dir_name, "test.ft.txt.bz2.zip")
bz2_file=os.path.join(dir_name, "test.ft.txt.bz2")
try:
    # Unzip the file
    with zipfile.ZipFile(zipped_file, 'r') as zip_ref:
        zip_ref.extractall(os.path.dirname(bz2_file))

    # Now bz2_file should contain the path to the unzipped file
    print(f"Unzipped file: {bz2_file}")
except Exception as e:
    print(f"Error during unzipping: {str(e)}")

# Load the bz2 file into a DataFrame
df = spark.read.text(input_file_path)
# Remove the '__label__1' and '__label__2' prefixes
df = df.withColumn("review_text", expr("regexp_replace(value,
'__label__[12] ', '')"))

Then the rest is just spark-ml

Once I finished I remove the bz2 file to cleanup

if os.path.exists(bz2_file):  # Check if bz2 file exists
  try:
    os.remove(bz2_file)
    print(f"Successfully deleted {bz2_file}")
  except OSError as e:
    print(f"Error deleting {bz2_file}: {e}")
else:
    print(f"bz2 file {bz2_file} could not be found")


My question is can these operations be done more efficiently in Pyspark
itself ideally with one df operation reading the original file (.bz2.zip)?

Thanks


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".

Working with a text file that is both compressed by bz2 followed by zip in PySpark

Reply via email to