How to process S3 data in Scalable Manner Using Spark API (wholeTextFile VERY SLOW and NOT scalable)

2021-10-02 Thread Alchemist
Issue:  We are using wholeTextFile() API to read files from S3.  But this API 
is extremely SLOW due to reasons mentioned below.  Question is how to fix this 
issue?
Here is our analysis so FAR: 
Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile 
API works in two step. First step driver/master tries to list all the S3 files 
second step is driver/master tries to split the list of files and distribute 
those files to number of worker nodes and executor to process).

STEP 1. List all the s3 files in the given paths (we pass this path when we run 
the every single gw/device/app step). Issue is every single batch of every 
single report is first listing number of files. Main problem that we have is we 
are using S3 where listing files in a bucket is single threaded. This is 
because the S3 API for listing the keys in a bucket only returns keys by chunks 
of 1000 per call. Single threaded S3 API just tries to list files 1000 at a 
time, so for a million files we are looking at 1000 S3 single threaded API 
call. 

STEP 2. Control the number of splits depends on number of input partitions and 
distribute the load to worker nodes to process.



wholeTextAPI() extremely SLOW under high load (How to fix?)

2021-10-02 Thread Rachana Srivastava
Issue:  We are using wholeTextFile() API to read files from S3.  But this API 
is extremely SLOW due to reasons mentioned below.  Question is how to fix this 
issue?
Here is our analysis so FAR: 
Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile 
API works in two step. First step driver/master tries to list all the S3 files 
second step is driver/master tries to split the list of files and distribute 
those files to number of worker nodes and executor to process).

STEP 1. List all the s3 files in the given paths (we pass this path when we run 
the every single gw/device/app step). Issue is every single batch of every 
single report is first listing number of files. Main problem that we have is we 
are using S3 where listing files in a bucket is single threaded. This is 
because the S3 API for listing the keys in a bucket only returns keys by chunks 
of 1000 per call. Single threaded S3 API just tries to list files 1000 at a 
time, so for a million files we are looking at 1000 S3 single threaded API 
call. 

STEP 2. Control the number of splits depends on number of input partitions and 
distribute the load to worker nodes to process.


Re: Choice of IDE for Spark

2021-10-02 Thread Christian Pfarr
We use Jupyter on Hadoop https://jupyterhub-on-hadoop.readthedocs.io/en/latest/ 
for developing spark jobs directly inside the Cluster it should run.




With that you have direct access to yarn and hdfs (fully secured) without any 
migration steps.




You can control the size of your Jupyter yarn container and of course your 
spark session.







Regards,




Christian











\ Original-Nachricht 
Am 2. Okt. 2021, 01:21, Holden Karau schrieb:

>
>
>
> Personally I like Jupyter notebooks for my interactive work and then once 
> I’ve done my exploration I switch back to emacs with either scala-metals or 
> Python mode.
>
>
>
>
> I think the main takeaway is: do what feels best for you, there is no one 
> true way to develop in Spark.
>
>
>
>
> On Fri, Oct 1, 2021 at 1:28 AM Mich Talebzadeh 
> <[mich.talebza...@gmail.com][mich.talebzadeh_gmail.com]> wrote:
>
>
> > Thanks guys for your comments.
> >
> >
> >
> >
> > I agree with you Florian that opening a terminal say in VSC allows you to 
> > run a shell script (an sh file) to submit your spark code, however, this 
> > really makes sense if your IDE is running on a Linux host submitting a job 
> > to a Kubernetes cluster or YARN cluster.
> >
> >
> >
> >
> > For Python, I will go with PyCharm which is specific to the Python world. 
> > With Spark, I have used IntelliJ with Spark plug in on MAC for development 
> > work. Then created a JAR file, gzipped the whole project and scped to an 
> > IBM sandbox, untarred it and ran it with a pre-prepared shell with 
> > environment plugin for dev, test, staging etc.
> >
> >
> >
> >
> > IDE is also useful for looking at csv, tsv type files or creating json from 
> > one form to another. For json validation,especially if the file is too 
> > large, you may have restriction loading the file to web json validator 
> > because of the risk of proprietary data being exposed. There is a tool 
> > called[ jq][jq] (a lightweight and flexible command-line JSON processor), 
> > that comes pretty handy to validate json. Download and install it on OS and 
> > run it as
> >
> >
> >
> >
> > zcat .tgz \| jq
> >
> >
> >
> >
> > That will validate the whole tarred and gzipped json file. Otherwise most 
> > of these IDE tools come with add-on plugins, for various needs. My 
> > preference would be to use the best available IDE for the job. VSC I would 
> > consider as a general purpose tool. If all fails, one can always use OS 
> > stuff like vi, vim, sed, awk etc 🤔
> >
> >
> >
> >
> >
> >
> >
> > Cheers
> >
> >
> >
> >
> > ![uc_export_download_id_1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ_revid_0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ][][view
> >  my Linkedin profile][]
> >
> > **Disclaimer:** Use it at your own risk.Any and all responsibility for any 
> > loss, damage or destruction of data or any other property which may arise 
> > from relying on this email's technical content is explicitly disclaimed. 
> > The author will in no case be liable for any monetary damages arising from 
> > such loss, damage or destruction.
> >
> >
> >
> >
> >
> >
> >
> > On Fri, 1 Oct 2021 at 06:55, Florian CASTELAIN 
> > <[florian.castel...@redlab.io][Florian.CASTELAIN_redlab.io]> wrote:
> >
> >
> > > Hello.
> > >
> > >
> > >
> > >
> > > Any "evolved" code editor allows you to create tasks (or builds, or 
> > > whatever they are called in the IDE you chose). If you do not find 
> > > anything that packages by default all you need, you could just create 
> > > your own tasks.
> > >
> > > *For yarn, one needs to open a terminal and submit from there.*
> > >
> > >
> > >
> > >
> > >
> > > You can create task(s) that launch your yarn commands.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *With VSC, you get stuff for working with json files but I am not sure 
> > > with a plugin for Python*
> > >
> > >
> > >
> > >
> > >
> > > In your json task configuration, you can launch whatever you want: 
> > > python, shell. I bet you could launch your favorite video game (just make 
> > > a task called "let's have a break" 😉)
> > >
> > >
> > >
> > >
> > > Just to say, if you want everything exactly the way you want, I do not 
> > > think you will find an IDE that does it. You will have to customize it. 
> > > (correct me if wrong, of course).
> > >
> > >
> > >
> > >
> > > Have a good day.
> > >
> > >
> > >
> > >
> > > [**![signature\_299490615][signature_299490615]**][signature_299490615_signature_299490615]
> > >
> > > |  |  | 
> > > -
> > >  |
> > > | [![Banner][signature_299490615]][Banner_signature_299490615] |  | 
> > > **Florian CASTELAIN****Ingénieur Logiciel**  [72 Rue de la 
> > > République, 76140

Re: Choice of IDE for Spark

2021-10-02 Thread Паша
Disclaimer: I'm developer avocado for data engineering at JetBrains, so I'm
definitely biased.

And if someone likes Zeppelin — there is an awesome integration of Zeppelin
into IDEA via Big Data Tools plugin — one can perform any explorations they
want/need and then extract all their work into real code with a simple
refactoring → extract Spark Job.

--
Regards,
Pasha

сб, 2 окт. 2021 г. в 04:03, Holden Karau :

> Personally I like Jupyter notebooks for my interactive work and then once
> I’ve done my exploration I switch back to emacs with either scala-metals or
> Python mode.
>
> I think the main takeaway is: do what feels best for you, there is no one
> true way to develop in Spark.
>
> On Fri, Oct 1, 2021 at 1:28 AM Mich Talebzadeh 
> wrote:
>
>> Thanks guys for your comments.
>>
>> I agree with you Florian that opening a terminal say in VSC allows you to
>> run a shell script (an sh file) to submit your spark code, however, this
>> really makes sense if your IDE is running on a Linux host submitting a job
>> to a Kubernetes cluster or YARN cluster.
>>
>> For Python, I will go with PyCharm which is specific to the Python world.
>> With Spark, I have used IntelliJ with Spark plug in on MAC for development
>> work. Then created a JAR file, gzipped the whole project and scped to an
>> IBM sandbox, untarred it and ran it with a pre-prepared shell with
>> environment plugin for dev, test, staging etc.
>>
>> IDE is also useful for looking at csv, tsv type files or creating json
>> from one form to another. For json validation,especially if the file is too
>> large, you may have restriction loading the file to web json validator
>> because of the risk of proprietary data being exposed. There is a tool
>> called jq  (a lightweight and flexible
>> command-line JSON processor), that comes pretty handy to validate json.
>> Download and install it on OS and run it as
>>
>> zcat .tgz | jq
>>
>> That will validate the whole tarred and gzipped json file. Otherwise most
>> of these IDE tools come with add-on plugins, for various needs. My
>> preference would be to use the best available IDE for the job. VSC I would
>> consider as a general purpose tool. If all fails, one can always use OS
>> stuff like vi, vim, sed, awk etc 🤔
>>
>>
>> Cheers
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 1 Oct 2021 at 06:55, Florian CASTELAIN <
>> florian.castel...@redlab.io> wrote:
>>
>>> Hello.
>>>
>>> Any "evolved" code editor allows you to create tasks (or builds, or
>>> whatever they are called in the IDE you chose). If you do not find anything
>>> that packages by default all you need, you could just create your own tasks.
>>>
>>>
>>> *For yarn, one needs to open a terminal and submit from there. *
>>>
>>> You can create task(s) that launch your yarn commands.
>>>
>>>
>>> *With VSC, you get stuff for working with json files but I am not sure
>>> with a plugin for Python *
>>>
>>> In your json task configuration, you can launch whatever you want:
>>> python, shell. I bet you could launch your favorite video game (just make a
>>> task called "let's have a break" 😉)
>>>
>>> Just to say, if you want everything exactly the way you want, I do not
>>> think you will find an IDE that does it. You will have to customize it.
>>> (correct me if wrong, of course).
>>>
>>> Have a good day.
>>>
>>> *[image: signature_299490615]* 
>>>
>>>
>>>
>>> [image: Banner] 
>>>
>>>
>>>
>>> *Florian CASTELAIN *
>>> *Ingénieur Logiciel*
>>>
>>> 72 Rue de la République, 76140 Le Petit-Quevilly
>>> 
>>> m: +33 616 530 226
>>> e: florian.castel...@redlab.io w: www.redlab.io
>>>
>>> --
>>> *De :* Jeff Zhang 
>>> *Envoyé :* jeudi 30 septembre 2021 13:57
>>> *À :* Mich Talebzadeh 
>>> *Cc :* user @spark 
>>> *Objet :* Re: Choice of IDE for Spark
>>>
>>> IIRC, you want an IDE for pyspark on yarn ?
>>>
>>> Mich Talebzadeh  于2021年9月30日周四 下午7:00写道:
>>>
>>> Hi,
>>>
>>> This may look like a redundant question but it comes about because of
>>> the advent of Cloud workstation usage like Amazon workspaces and others.
>>>
>>> With IntelliJ you are OK with Spark & Scala. With PyCharm you are fine
>>> with PySpark and the virtual environment. Mind you as far as I know PyCharm
>>> only executes spark-submit in local mode. For yarn, one needs to open a
>>> terminal and submit from there.
>>>
>>> However, in Amazon workstation, you get Visual