The error message is “file not found” Are you able to use the following command line to assess the file with the user you submitted the job? hdfs dfs -ls /tmp/sample.pdf
Sent from my iPhone > On Sep 28, 2018, at 12:10 PM, Joel D <games2013....@gmail.com> wrote: > > I'm trying to extract text from pdf files in hdfs using pdfBox. > However it throws an error: > > "Exception in thread "main" org.apache.spark.SparkException: ... > java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf > (No such file or directory)" > > > > What am I missing? Should I be working with PortableDataStream instead of the > string part of: > val files: RDD[(String, PortableDataStream)]? > def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: > SparkSession) = { > val file: File = new File(fileNameFromRDD._1.drop(5)) > val document = PDDocument.load(file); //It throws an error here. > > if (!document.isEncrypted()) { > val stripper = new PDFTextStripper() > val text = stripper.getText(document) > println("Text:" + text) > > } > document.close() > > } > > //This is where I call the above pdf to text converter method. > val files = > sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf") > files.foreach(println) > > files.foreach(f => println(f._1)) > > files.foreach(fileStream => pdfRead(fileStream, sparkSession)) > > Thanks. > > > > > >