Hello, We are running into issues while trying to process fixed length files using spark.
The approach we took is as follows: 1. Read the .bz2 file into a dataset from hdfs using spark.read().textFile() API.Create a temporary view. Dataset<String> rawDataset = sparkSession.read().textFile(filePath); rawDataset.createOrReplaceTempView(tempView); 2. Run a sql query on the view, to slice and dice the data the way we need it (using substring). (SELECT TRIM(SUBSTRING(value,1 ,16)) AS record1 , TRIM(SUBSTRING(value,17 ,8)) AS record2 , TRIM(SUBSTRING(value,25 ,5)) AS record3 , TRIM(SUBSTRING(value,30 ,16)) AS record4 , CAST(SUBSTRING(value,46 ,8) AS BIGINT) AS record5 , CAST(SUBSTRING(value,54 ,6) AS BIGINT) AS record6 , CAST(SUBSTRING(value,60 ,3) AS BIGINT) AS record7 , CAST(SUBSTRING(value,63 ,6) AS BIGINT) AS record8 , TRIM(SUBSTRING(value,69 ,20)) AS record9 , TRIM(SUBSTRING(value,89 ,40)) AS record10 , TRIM(SUBSTRING(value,129 ,32)) AS record11 , TRIM(SUBSTRING(value,161 ,19)) AS record12, TRIM(SUBSTRING(value,180 ,1)) AS record13 , TRIM(SUBSTRING(value,181 ,9)) AS record14 , TRIM(SUBSTRING(value,190 ,3)) AS record15 , CAST(SUBSTRING(value,193 ,8) AS BIGINT) AS record16 , CAST(SUBSTRING(value,201 ,8) AS BIGINT) AS record17 FROM tempView) 3.Write the output of sql query to a parquet file. loadDataset.write().mode(SaveMode.Append).parquet(outputDirectory); Problem : The step #2 takes a longer time , if the length of line is ~2000 characters. If each line in the file is only 1000 characters then it takes only 4 minutes to process 20 million lines. If we increase the line length to 2000 characters it takes 20 minutes to process 20 million lines. Is there a better way in spark to parse fixed length lines? *Note: *Spark version we use is 2.2.0 and we are using Spark with Java. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org