Re: SPARK Issue in Standalone cluster

2017-08-22 Thread Sea aj
Hi everyone, I have a huge dataframe with 1 billion rows and each row is a nested list. That being said, I want to train some ML models on this df but due to the huge size, I get out memory error on one of my nodes when I run fit function. currently, my configuration is: 144 cores, 16 cores for

Re: SPARK Issue in Standalone cluster

2017-08-06 Thread Gourav Sengupta
Hi Marco, thanks a ton, I will surely use those alternatives. Regards, Gourav Sengupta On Sun, Aug 6, 2017 at 3:45 PM, Marco Mistroni wrote: > Sengupta > further to this, if you try the following notebook in databricks cloud, > it will read a .csv file , write to a

Re: SPARK Issue in Standalone cluster

2017-08-06 Thread Marco Mistroni
Sengupta further to this, if you try the following notebook in databricks cloud, it will read a .csv file , write to a parquet file and read it again (just to count the number of rows stored) Please note that the path to the csv file might differ for you. So, what you will need todo is 1 -

Re: SPARK Issue in Standalone cluster

2017-08-05 Thread Marco Mistroni
Uh believe me there are lots of ppl on this list who will send u code snippets if u ask...  Yes that is what Steve pointed out, suggesting also that for that simple exercise you should perform all operations on a spark standalone instead (or alt. Use an nfs on the cluster) I'd agree with his

Re: SPARK Issue in Standalone cluster

2017-08-05 Thread Gourav Sengupta
Hi Marco, For the first time in several years FOR THE VERY FIRST TIME. I am seeing someone actually executing code and providing response. It feel wonderful that at least someone considered to respond back by executing code and just did not filter out each and every technical details to brood

Re: SPARK Issue in Standalone cluster

2017-08-04 Thread Gourav Sengupta
Hi Marco, I am sincerely obliged for your kind time and response. Can you please try the solution that you have so kindly suggested? It will be a lot of help if you could kindly execute the code that I have given. I dont think that anyone has yet. There are lots of fine responses to my question

Re: SPARK Issue in Standalone cluster

2017-08-04 Thread Jean Georges Perrin
I use CIFS and it works reasonably well and easily cross platform, well documented... > On Aug 4, 2017, at 6:50 AM, Steve Loughran wrote: > > >> On 3 Aug 2017, at 19:59, Marco Mistroni wrote: >> >> Hello >> my 2 cents here, hope it helps >> If

Re: SPARK Issue in Standalone cluster

2017-08-04 Thread Steve Loughran
> On 3 Aug 2017, at 19:59, Marco Mistroni wrote: > > Hello > my 2 cents here, hope it helps > If you want to just to play around with Spark, i'd leave Hadoop out, it's an > unnecessary dependency that you dont need for just running a python script > Instead do the

Re: SPARK Issue in Standalone cluster

2017-08-03 Thread Marco Mistroni
Hello my 2 cents here, hope it helps If you want to just to play around with Spark, i'd leave Hadoop out, it's an unnecessary dependency that you dont need for just running a python script Instead do the following: - got to the root of our master / slave node. create a directory /root/pyscripts -

Re: SPARK Issue in Standalone cluster

2017-08-03 Thread Gourav Sengupta
Hi Steve, I love you mate, thanks a ton once again for ACTUALLY RESPONDING. I am now going through the documentation ( https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3a_committer_architecture.md) and it

Re: SPARK Issue in Standalone cluster

2017-08-03 Thread Steve Loughran
On 2 Aug 2017, at 20:05, Gourav Sengupta > wrote: Hi Steve, I have written a sincere note of apology to everyone in a separate email. I sincerely request your kind forgiveness before hand if anything does sound impolite in my

Re: SPARK Issue in Standalone cluster

2017-08-02 Thread Gourav Sengupta
an on spark 1.5 may have been because the executor ran on >> the driver itself. There is not much use to a set up where you don’t have >> some kind of distributed file system, so I would encourage you to use hdfs, >> or a mounted file system shared by all nodes. >> >>

Re: SPARK Issue in Standalone cluster

2017-08-02 Thread Frank Austin Nothaft
by all nodes. > > > > Regards, > > Mahesh > > > > > > From: Gourav Sengupta [mailto:gourav.sengu...@gmail.com > <mailto:gourav.sengu...@gmail.com>] > Sent: Monday, July 31, 2017 9:54 PM > To: Riccardo Ferrari > Cc: user > Subject:

Re: SPARK Issue in Standalone cluster

2017-08-02 Thread Steve Loughran
o:gourav.sengu...@gmail.com>] Sent: Monday, July 31, 2017 9:54 PM To: Riccardo Ferrari Cc: user Subject: Re: SPARK Issue in Standalone cluster Hi Riccardo, I am grateful for your kind response. Also I am sure that your answer is completely wrong and errorneous. SPARK must be having a method

Re: SPARK Issue in Standalone cluster

2017-08-02 Thread Gourav Sengupta
so I would encourage you to use hdfs, > or a mounted file system shared by all nodes. > > > > Regards, > > Mahesh > > > > > > *From:* Gourav Sengupta [mailto:gourav.sengu...@gmail.com] > *Sent:* Monday, July 31, 2017 9:54 PM > *To:* Riccardo Ferrari >

RE: SPARK Issue in Standalone cluster

2017-07-31 Thread Mahesh Sawaiker
, or a mounted file system shared by all nodes. Regards, Mahesh From: Gourav Sengupta [mailto:gourav.sengu...@gmail.com] Sent: Monday, July 31, 2017 9:54 PM To: Riccardo Ferrari Cc: user Subject: Re: SPARK Issue in Standalone cluster Hi Riccardo, I am grateful for your kind response. Also I am

Re: SPARK Issue in Standalone cluster

2017-07-31 Thread Gourav Sengupta
Hi Riccardo, I am grateful for your kind response. Also I am sure that your answer is completely wrong and errorneous. SPARK must be having a method so that different executors do not pick up the same files to process. You also did not answer the question why was the processing successful in

Re: SPARK Issue in Standalone cluster

2017-07-31 Thread Riccardo Ferrari
Hi Gourav, The issue here is the location where you're trying to write/read from : /Users/gouravsengupta/Development/spark/sparkdata/test1/p... When dealing with clusters all the paths and resources should be available to all executors (and driver), and that is reason why you generally use HDFS,

SPARK Issue in Standalone cluster

2017-07-30 Thread Gourav Sengupta
Hi, I am working by creating a native SPARK standalone cluster ( https://spark.apache.org/docs/2.2.0/spark-standalone.html) Therefore I do not have a HDFS. EXERCISE: Its the most fundamental and simple exercise. Create a sample SPARK dataframe and then write it to a location and then read it