Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-23 Thread Dhrubajyoti Hati
FYI we are using Spark 2.2.0. Should the change be present in this spark version? Wanted to check before opening a JIRA ticket? *Regards,Dhrubajyoti Hati.* On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan wrote: > This looks like a bug that path filter doesn't work for hive table &

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-22 Thread Dhrubajyoti Hati
Just wondering if any one could help me out on this. Thank you! *Regards,Dhrubajyoti Hati.* On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati wrote: > Hi, > > Is there any way to discard files starting with dot(.) or ending with .tmp > in the hive partition while reading fro

Error while reading hive tables with tmp/hidden files inside partitions

2020-04-22 Thread Dhrubajyoti Hati
Hi, Is there any way to discard files starting with dot(.) or ending with .tmp in the hive partition while reading from Hive table using spark.read.table method. I tried using PathFilters but they didn't work. I am using spark-submit and passing my python file(pyspark) containing the source code.

Re: Collections passed from driver to executors

2019-09-23 Thread Dhrubajyoti Hati
3, 2019 at 8:17 PM, Dhrubajyoti Hati > wrote: > >> I was wondering if anyone could help with this question. >> >> On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, >> wrote: >> >>> Hi, >>> >>> I have a question regarding passing a dictionary f

Re: Collections passed from driver to executors

2019-09-23 Thread Dhrubajyoti Hati
I was wondering if anyone could help with this question. On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, wrote: > Hi, > > I have a question regarding passing a dictionary from driver to executors > in spark on yarn. This dictionary is needed in an udf. I am using pyspark. > &g

Collections passed from driver to executors

2019-09-19 Thread Dhrubajyoti Hati
Hi, I have a question regarding passing a dictionary from driver to executors in spark on yarn. This dictionary is needed in an udf. I am using pyspark. As I understand this can be passed in two ways: 1. Broadcast the variable and then use it in the udfs 2. Pass the dictionary in the udf itself

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
Also the performance remains identical when running the same script from jupyter terminal instead or normal terminal. In the script the spark context is created by spark = SparkSession \ .builder \ .. .. getOrCreate() command On Wed, Sep 11, 2019 at 10:28 PM Dhrubajyoti Hati wrote: >

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
e you creating the Spark Session in jupyter ? > > > On Wed, Sep 11, 2019 at 7:33 PM Dhrubajyoti Hati > wrote: > >> But would it be the case for multiple tasks running on the same worker >> and also both the tasks are running in client mode, so the one true is true >&g

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
eight > minutes. > > On Wed, Sep 11, 2019 at 3:17 AM Dhrubajyoti Hati > wrote: > >> Hi, >> >> I just ran the same script in a shell in jupyter notebook and find the >> performance to be similar. So I can confirm this is because the libraries >> used jupyter

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
. *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* On Wed, Sep 11, 2019 at 9:45 AM Dhrubajyoti Hati wrote: > Just checked from where the script is submitted i.e. wrt Driver, the > python env are different. Jupyter one is running within a the virtual > environment which is Python

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
but in any case: are they >> both running against the same spark cluster with the same configuration >> parameters especially executor memory and number of workers? >> >> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < >> dhruba.w...@gmail.com>

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
> > Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < > dhruba.w...@gmail.com>: > >> No, i checked for that, hence written "brand new" jupyter notebook. Also >> the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs >> co

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
sks for each. > > On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati > wrote: > >> Hi, >> >> I am facing a weird behaviour while running a python script. Here is what >> the code looks like mostly: >> >> def fn1(ip): >>some code.

script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
Hi, I am facing a weird behaviour while running a python script. Here is what the code looks like mostly: def fn1(ip): some code... ... def fn2(row): ... some operations ... return row1 udf_fn1 = udf(fn1) cdf = spark.read.table("") //hive table is of size > 500 Gigs

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Dhrubajyoti Hati
ly directly translate > to heap usage. Here you just need a bit more memory. > > On Mon, Jul 29, 2019 at 9:03 AM Dhrubajyoti Hati > wrote: > > > > Hi Sean, > > > > Yeah I checked the heap, its almost full. I checked the GC logs in the > executors where I f

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Dhrubajyoti Hati
memory'. I don't see that > you've checked heap usage - is it nearly full? > The answer isn't tuning but more heap. > (Sometimes with really big heaps the problem is big pauses, but that's > not the case here.) > > On Mon, Jul 29, 2019 at 1:26 AM Dhrubajy

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Dhrubajyoti Hati
sually > also requires more memory for the executor, but less executors. Similarly > the executor instances might be too many and they may not have enough heap. > You can also increase the memory of the executor. > > Am 29.07.2019 um 08:22 schrieb Dhrubajyoti Hati : > > Hi, >

Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-28 Thread Dhrubajyoti Hati
Hi, We were running Logistic Regression in Spark 2.2.X and then we tried to see how does it do in Spark 2.3.X. Now we are facing an issue while running a Logistic Regression Model in Spark 2.3.X on top of Yarn(GCP-Dataproc). In the TreeAggregate method it takes a huge time due to very High GC Acti

Timeline for stable release for Spark Structured Streaming

2017-07-10 Thread Dhrubajyoti Hati
egards,* *​Dhrubajyoti Hati* *LinkedIn <https://www.linkedin.com/in/dhrubajyoti-hati-9213a92a/>*