Hello,

 We are running into issues while trying to process fixed length files using
spark.

The approach we took is as follows:

1. Read the .bz2 file  into a dataset from hdfs using
spark.read().textFile() API.Create a temporary view.

     Dataset<String> rawDataset = sparkSession.read().textFile(filePath);
     rawDataset.createOrReplaceTempView(tempView);

2. Run a sql query on the view, to slice and dice the data the way we need
it (using substring).

 (SELECT 
                     TRIM(SUBSTRING(value,1 ,16)) AS record1 ,
                     TRIM(SUBSTRING(value,17 ,8)) AS record2 ,
                     TRIM(SUBSTRING(value,25 ,5)) AS record3 ,
                     TRIM(SUBSTRING(value,30 ,16)) AS record4 ,
                     CAST(SUBSTRING(value,46 ,8) AS BIGINT) AS record5 , 
                     CAST(SUBSTRING(value,54 ,6) AS BIGINT) AS record6 , 
                     CAST(SUBSTRING(value,60 ,3) AS BIGINT) AS record7 , 
                     CAST(SUBSTRING(value,63 ,6) AS BIGINT) AS record8 , 
                     TRIM(SUBSTRING(value,69 ,20)) AS record9 ,
                     TRIM(SUBSTRING(value,89 ,40)) AS record10 ,
                     TRIM(SUBSTRING(value,129 ,32)) AS record11 ,
                     TRIM(SUBSTRING(value,161 ,19)) AS record12,
                     TRIM(SUBSTRING(value,180 ,1)) AS record13 ,
                     TRIM(SUBSTRING(value,181 ,9)) AS record14 ,
                     TRIM(SUBSTRING(value,190 ,3)) AS record15 ,
                     CAST(SUBSTRING(value,193 ,8) AS BIGINT) AS record16 , 
                     CAST(SUBSTRING(value,201 ,8) AS BIGINT) AS record17 
                     FROM tempView)

3.Write the output of sql query to a parquet file.
     loadDataset.write().mode(SaveMode.Append).parquet(outputDirectory);

Problem :

  The step #2 takes a longer time , if the length of line is ~2000
characters. If each line in the file is only 1000 characters then it takes
only 4 minutes to process 20 million lines. If we increase the line length
to 2000 characters it takes 20 minutes to process 20 million lines.


Is there a better way in spark to parse fixed length lines?


*Note: *Spark version we use is 2.2.0 and we are using  Spark with Java.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to