Re: Text from pdf spark
Yes, I can access the file using cli. On Fri, Sep 28, 2018 at 1:24 PM kathleen li wrote: > The error message is “file not found” > Are you able to use the following command line to assess the file with the > user you submitted the job? > hdfs dfs -ls /tmp/sample.pdf > > Sent from my iPhone > > On Sep 28, 2018, at 12:10 PM, Joel D wrote: > > I'm trying to extract text from pdf files in hdfs using pdfBox. > > However it throws an error: > > "Exception in thread "main" org.apache.spark.SparkException: ... > > java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf > > (No such file or directory)" > > > > > What am I missing? Should I be working with PortableDataStream instead of > the string part of: > > val files: RDD[(String, PortableDataStream)]? > > def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: > SparkSession) = { > > val file: File = new File(fileNameFromRDD._1.drop(5)) > > val document = PDDocument.load(file); //It throws an error here. > > > if (!document.isEncrypted()) { > > val stripper = new PDFTextStripper() > > val text = stripper.getText(document) > > println("Text:" + text) > > > } > > document.close() > > > } > > > //This is where I call the above pdf to text converter method. > > val files = > sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf") > > files.foreach(println) > > > files.foreach(f => println(f._1)) > > > files.foreach(fileStream => pdfRead(fileStream, sparkSession)) > > > Thanks. > > > > > > > >
Re: Text from pdf spark
The error message is “file not found” Are you able to use the following command line to assess the file with the user you submitted the job? hdfs dfs -ls /tmp/sample.pdf Sent from my iPhone > On Sep 28, 2018, at 12:10 PM, Joel D wrote: > > I'm trying to extract text from pdf files in hdfs using pdfBox. > However it throws an error: > > "Exception in thread "main" org.apache.spark.SparkException: ... > java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf > (No such file or directory)" > > > > What am I missing? Should I be working with PortableDataStream instead of the > string part of: > val files: RDD[(String, PortableDataStream)]? > def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: > SparkSession) = { > val file: File = new File(fileNameFromRDD._1.drop(5)) > val document = PDDocument.load(file); //It throws an error here. > > if (!document.isEncrypted()) { > val stripper = new PDFTextStripper() > val text = stripper.getText(document) > println("Text:" + text) > > } > document.close() > > } > > //This is where I call the above pdf to text converter method. > val files = > sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf") > files.foreach(println) > > files.foreach(f => println(f._1)) > > files.foreach(fileStream => pdfRead(fileStream, sparkSession)) > > Thanks. > > > > > >
Text from pdf spark
I'm trying to extract text from pdf files in hdfs using pdfBox. However it throws an error: "Exception in thread "main" org.apache.spark.SparkException: ... java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf (No such file or directory)" What am I missing? Should I be working with PortableDataStream instead of the string part of: val files: RDD[(String, PortableDataStream)]? def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: SparkSession) = { val file: File = new File(fileNameFromRDD._1.drop(5)) val document = PDDocument.load(file); //It throws an error here. if (!document.isEncrypted()) { val stripper = new PDFTextStripper() val text = stripper.getText(document) println("Text:" + text) } document.close() } //This is where I call the above pdf to text converter method. val files = sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf") files.foreach(println) files.foreach(f => println(f._1)) files.foreach(fileStream => pdfRead(fileStream, sparkSession)) Thanks.