Re: Text from pdf spark

2018-09-28 Thread Joel D
Yes, I can access the file using cli.

On Fri, Sep 28, 2018 at 1:24 PM kathleen li  wrote:

> The error message is “file not found”
> Are you able to use the following command line to assess the file with the
> user you submitted the job?
> hdfs dfs -ls /tmp/sample.pdf
>
> Sent from my iPhone
>
> On Sep 28, 2018, at 12:10 PM, Joel D  wrote:
>
> I'm trying to extract text from pdf files in hdfs using pdfBox.
>
> However it throws an error:
>
> "Exception in thread "main" org.apache.spark.SparkException: ...
>
> java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf
>
> (No such file or directory)"
>
>
>
>
> What am I missing? Should I be working with PortableDataStream instead of
> the string part of:
>
> val files: RDD[(String, PortableDataStream)]?
>
> def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession:
> SparkSession) = {
>
> val file: File = new File(fileNameFromRDD._1.drop(5))
>
> val document = PDDocument.load(file); //It throws an error here.
>
>
> if (!document.isEncrypted()) {
>
>   val stripper = new PDFTextStripper()
>
>   val text = stripper.getText(document)
>
>   println("Text:" + text)
>
>
> }
>
> document.close()
>
>
>   }
>
>
> //This is where I call the above pdf to text converter method.
>
>  val files =
> sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")
>
> files.foreach(println)
>
>
> files.foreach(f => println(f._1))
>
>
> files.foreach(fileStream => pdfRead(fileStream, sparkSession))
>
>
> Thanks.
>
>
>
>
>
>
>
>


Re: Text from pdf spark

2018-09-28 Thread kathleen li
The error message is “file not found”
Are you able to use the following command line to assess the file with the user 
you submitted the job?
hdfs dfs -ls /tmp/sample.pdf

Sent from my iPhone

> On Sep 28, 2018, at 12:10 PM, Joel D  wrote:
> 
> I'm trying to extract text from pdf files in hdfs using pdfBox. 
> However it throws an error:
> 
> "Exception in thread "main" org.apache.spark.SparkException: ...
> java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf 
> (No such file or directory)"
> 
> 
> 
> What am I missing? Should I be working with PortableDataStream instead of the 
> string part of:
> val files: RDD[(String, PortableDataStream)]?
> def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: 
> SparkSession) = {
> val file: File = new File(fileNameFromRDD._1.drop(5))
> val document = PDDocument.load(file); //It throws an error here.
> 
> if (!document.isEncrypted()) {
>   val stripper = new PDFTextStripper()
>   val text = stripper.getText(document)
>   println("Text:" + text)
> 
> }
> document.close()
> 
>   }
> 
> //This is where I call the above pdf to text converter method.
>  val files = 
> sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")
> files.foreach(println)
> 
> files.foreach(f => println(f._1))
> 
> files.foreach(fileStream => pdfRead(fileStream, sparkSession))
> 
> Thanks.
> 
> 
> 
> 
> 
> 


Text from pdf spark

2018-09-28 Thread Joel D
I'm trying to extract text from pdf files in hdfs using pdfBox.

However it throws an error:

"Exception in thread "main" org.apache.spark.SparkException: ...

java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf

(No such file or directory)"




What am I missing? Should I be working with PortableDataStream instead of
the string part of:

val files: RDD[(String, PortableDataStream)]?

def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession:
SparkSession) = {

val file: File = new File(fileNameFromRDD._1.drop(5))

val document = PDDocument.load(file); //It throws an error here.


if (!document.isEncrypted()) {

  val stripper = new PDFTextStripper()

  val text = stripper.getText(document)

  println("Text:" + text)


}

document.close()


  }


//This is where I call the above pdf to text converter method.

 val files =
sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")

files.foreach(println)


files.foreach(f => println(f._1))


files.foreach(fileStream => pdfRead(fileStream, sparkSession))


Thanks.