Re: Can spark somehow help with this usecase?

Marco Mistroni Tue, 05 Apr 2016 11:43:28 -0700

Many thanks for suggestion Andy!
Kr
Marco
On 5 Apr 2016 7:25 pm, "Andy Davidson" <a...@santacruzintegration.com>
wrote:


> Hi Marco
>
> You might consider setting up some sort of ELT pipe line. One of your
> stages might be to create a file of all the FTP URL.  You could then write
> a spark app that just fetches the urls and stores the data in some sort of
> data base or on the file system (hdfs?)
>
> My guess would be to maybe use the map() transform to make the FTP call.
> If you are using Java or scala take a look at the apache commons FTP Client
>
> I assume that each ftp get is independent. *Maybe some one know more
> about how to control the amount of concurrency*. I think it will be based
> on the number of partitions, works, and cores?
>
> Andy
>
> From: Marco Mistroni <mmistr...@gmail.com>
> Date: Tuesday, April 5, 2016 at 9:13 AM
> To: "user @spark" <user@spark.apache.org>
> Subject: Can spark somehow help with this usecase?
>
> Hi
> I m currently using spark to process a file containing a million of
> rows(edgar quarterly filings files)
> Each row contains some infos plus a location of a remote file which I need
> to retrieve using FTP and then process it's content.
> I want to do all 3 operations ( process filing file, fetch remote files
> and process them in ) in one go.
> I want to avoid doing the first step (processing the million row file) in
> spark and the rest (_fetching FTP and process files) offline.
> Does spark has anything that can help with the FTP fetch?
>
> Thanks in advance and rgds
> Marco
>
>

Re: Can spark somehow help with this usecase?

Reply via email to