[pyspark] Load a master data file to spark ecosystem

Arjun Chundiran Fri, 24 Apr 2020 08:17:43 -0700

Hi Team,

I have asked this question in stack overflow
<https://stackoverflow.com/questions/61386719/load-a-master-data-file-to-spark-ecosystem>
and I didn't really get any convincing answers. Can somebody help me to
solve this issue?


Below is my problem

While building a log processing system, I came across a scenario where I
need to look up data from a tree file (Like a DB) for each and every log
line for corresponding value. What is the best approach to load an external
file which is very large into the spark ecosystem? The tree file is of size
2GB.

Here is my scenario

   1. I have a file contains huge number of log lines.
   2. Each log line needs to be split by a delimiter to 70 fields
   3. Need to lookup the data from tree file for one of the 70 fields of a
   log line.

I am using Apache Spark Python API and running on a 3 node cluster.

Below is the code which I have written. But it is really slow

def process_logline(line, tree):
    row_dict = {}
    line_list = line.split(" ")
    row_dict["host"] = tree_lookup_value(tree, line_list[0])
    new_row = Row(**row_dict)
    return new_row
def run_job(vals):
    spark.sparkContext.addFile('somefile')
    tree_val = open(SparkFiles.get('somefile'))
    lines = spark.sparkContext.textFile("log_file")
    converted_lines_rdd = lines.map(lambda l: process_logline(l, tree_val))
    log_line_rdd = spark.createDataFrame(converted_lines_rdd)
    log_line_rdd.show()

Basically I need some option to load the file one time in memory of
workers and start using it entire job life time using Python API.

Thanks in advance
Arjun

[pyspark] Load a master data file to spark ecosystem

Reply via email to