tesmai4 wrote > I am converting my Java based NLP parser to execute it on my Spark > cluster. I know that Spark can read multiple text files from a directory > and convert into RDDs for further processing. My input data is not only in > text files, but in a multitude of different file formats. > > My question is: How can I efficiently read the input files > (PDF/Text/Word/HTML) in my Java based Spark program for processing these > files in Spark cluster.
I will suggest flume <https://flume.apache.org/> . Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. I will also mention kafka <https://kafka.apache.org/> . Kafka is a distributed streaming platform. It is also popular to use both flume and kafka together ( flafka <http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/> ). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-PDF-text-word-file-efficiently-with-Spark-tp28699p28705.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org