Hi

My environment is described like below:

5 nodes, each nodes generate a big csv file every 5 minutes. I need spark
stream to analyze these 5 files in every five minutes to generate some
report.

I am planning to do it in this way:

1. Put those 5 files into HDSF directory called /data
2. Merge them into one big file in that directory
3. Use spark stream constructor textFileStream('/data') to generate my
inputDStream

The problem of this way is I do not know how to merge the 5 files in HDFS.
It seems very difficult to do it in python.

So question is 

1. Can you tell me how to merge files in hdfs by python?
2. Do you know some other way to input those files into spark?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-design-the-input-source-of-spark-stream-tp26641.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to