Hi Marco

You might consider setting up some sort of ELT pipe line. One of your stages
might be to create a file of all the FTP URL.  You could then write a spark
app that just fetches the urls and stores the data in some sort of data base
or on the file system (hdfs?)

My guess would be to maybe use the map() transform to make the FTP call. If
you are using Java or scala take a look at the apache commons FTP Client

I assume that each ftp get is independent. Maybe some one know more about
how to control the amount of concurrency. I think it will be based on the
number of partitions, works, and cores?

Andy

From:  Marco Mistroni <mmistr...@gmail.com>
Date:  Tuesday, April 5, 2016 at 9:13 AM
To:  "user @spark" <user@spark.apache.org>
Subject:  Can spark somehow help with this usecase?

> 
> Hi 
>  I m currently using spark to process a file containing a million of
> rows(edgar quarterly filings files)
> Each row contains some infos plus a location of a remote file which I need to
> retrieve using FTP and then process it's content.
> I want to do all 3 operations ( process filing file, fetch remote files and
> process them in ) in one go.
> I want to avoid doing the first step (processing the million row file) in
> spark and the rest (_fetching FTP and process files) offline.
> Does spark has anything that can help with the FTP fetch?
> 
> Thanks in advance and rgds
>  Marco


Reply via email to