Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Shuporno Choudhury
.invokeMethod(AbstractCommand.java:132) >> at py4j.commands.CallCommand.execute(CallCommand.java:79) >> at py4j.GatewayConnection.run(GatewayConnection.java:238) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: java.lang.ClassNotFoundException: >> com.amazonaws.auth.AWSCredentialsProv

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Shuporno Choudhury
On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury < shuporno.choudh...@gmail.com> wrote: > Hi, > Your connection config uses 's3n' but your read command uses 's3a'. > The config for s3a are: > spark.hadoop.fs.s3a.access.key > spark.hadoop.fs.s3a.secret.key > > I feel th

Re: CSV parser - is there a way to find malformed csv record

2018-10-09 Thread Shuporno Choudhury
template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- --Thanks, Shuporno Choudhury

Re: Clearing usercache on EMR [pyspark]

2018-08-03 Thread Shuporno Choudhury
Can anyone please help me with this issue? On Fri, 3 Aug 2018 at 11:27, Shuporno Choudhury < shuporno.choudh...@gmail.com> wrote: > Can anyone please help me with this issue? > > On Wed, 1 Aug 2018 at 12:50, Shuporno Choudhury [via Apache Spark User > List] wrote: > &g

Clearing usercache on EMR [pyspark]

2018-08-01 Thread Shuporno Choudhury
to avoid creating these directories or automatically clearing the usercache/filecache after a job/periodically? -- --Thanks, Shuporno Choudhury

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Shuporno Choudhury
that the variables are not daisy-chained/inter-related as > that too will not make it easy. > > > > > > *From: *Jay <[hidden email] > <http:///user/SendEmail.jtp?type=node=32465=1>> > *Date: *Monday, June 4, 2018 at 9:41 PM > *To: *Shuporno Choudhury <[hidden emai

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Shuporno Choudhury
big monotholic > applications. > > > On 4. Jun 2018, at 22:02, Shuporno Choudhury <[hidden email] > <http:///user/SendEmail.jtp?type=node=32458=0>> wrote: > > Hi, > > Thanks for the input. > I was trying to get the functionality first, hence I was using local mo

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Shuporno Choudhury
your code and write for each process an > independent python program that is submitted via Spark? > > Not sure though if Spark local make sense. If you don’t have a cluster > then a normal python program can be much better. > > On 4. Jun 2018, at 21:37, Shuporno Choudhury <

[PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Shuporno Choudhury
that is doing all the processing. If it is not possible to clear out memory, what can be a better approach for this problem? Can someone please help me with this and tell me if I am going wrong anywhere? --Thanks, Shuporno Choudhury

[pyspark] Read multiple files parallely into a single dataframe

2018-05-04 Thread Shuporno Choudhury
Hi, I want to read multiple files parallely into 1 dataframe. But the files have random names and cannot confirm to any pattern (so I can't use wildcard). Also, the files can be in different directories. If I provide the file names in a list to the dataframe reader, it reads then sequentially.

Getting Corrupt Records while loading data into dataframe from csv file

2018-04-23 Thread Shuporno Choudhury
, it seems the options that I have the are the 3 modes (PERMISSIVE, DROPMALFORMED and FAILFAST), none of which seem to fulfill the objective. -- --Thanks, Shuporno Choudhury

Multiple columns using 'isin' command in pyspark

2018-03-29 Thread Shuporno Choudhury
ould be really appreciated. -- Thanks, Shuporno Choudhury