Many thanks for suggestion Andy! Kr Marco On 5 Apr 2016 7:25 pm, "Andy Davidson" <a...@santacruzintegration.com> wrote:
> Hi Marco > > You might consider setting up some sort of ELT pipe line. One of your > stages might be to create a file of all the FTP URL. You could then write > a spark app that just fetches the urls and stores the data in some sort of > data base or on the file system (hdfs?) > > My guess would be to maybe use the map() transform to make the FTP call. > If you are using Java or scala take a look at the apache commons FTP Client > > I assume that each ftp get is independent. *Maybe some one know more > about how to control the amount of concurrency*. I think it will be based > on the number of partitions, works, and cores? > > Andy > > From: Marco Mistroni <mmistr...@gmail.com> > Date: Tuesday, April 5, 2016 at 9:13 AM > To: "user @spark" <user@spark.apache.org> > Subject: Can spark somehow help with this usecase? > > Hi > I m currently using spark to process a file containing a million of > rows(edgar quarterly filings files) > Each row contains some infos plus a location of a remote file which I need > to retrieve using FTP and then process it's content. > I want to do all 3 operations ( process filing file, fetch remote files > and process them in ) in one go. > I want to avoid doing the first step (processing the million row file) in > spark and the rest (_fetching FTP and process files) offline. > Does spark has anything that can help with the FTP fetch? > > Thanks in advance and rgds > Marco > >