The error message is “file not found”
Are you able to use the following command line to assess the file with the user 
you submitted the job?
hdfs dfs -ls /tmp/sample.pdf

Sent from my iPhone

> On Sep 28, 2018, at 12:10 PM, Joel D <games2013....@gmail.com> wrote:
> 
> I'm trying to extract text from pdf files in hdfs using pdfBox. 
> However it throws an error:
> 
> "Exception in thread "main" org.apache.spark.SparkException: ...
> java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf 
> (No such file or directory)"
> 
> 
> 
> What am I missing? Should I be working with PortableDataStream instead of the 
> string part of:
> val files: RDD[(String, PortableDataStream)]?
> def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: 
> SparkSession) = {
> val file: File = new File(fileNameFromRDD._1.drop(5))
> val document = PDDocument.load(file); //It throws an error here.
> 
> if (!document.isEncrypted()) {
>   val stripper = new PDFTextStripper()
>   val text = stripper.getText(document)
>   println("Text:" + text)
> 
> }
>     document.close()
> 
>   }
> 
> //This is where I call the above pdf to text converter method.
>      val files = 
> sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")
>     files.foreach(println)
> 
>     files.foreach(f => println(f._1))
> 
>     files.foreach(fileStream => pdfRead(fileStream, sparkSession))
> 
> Thanks.
> 
> 
> 
> 
> 
> 

Reply via email to