FYI we are using Spark 2.2.0. Should the change be present in this spark
version? Wanted to check before opening a JIRA ticket?
*Regards,Dhrubajyoti Hati.*
On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan wrote:
> This looks like a bug that path filter doesn't work for hive table
&
Just wondering if any one could help me out on this.
Thank you!
*Regards,Dhrubajyoti Hati.*
On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati
wrote:
> Hi,
>
> Is there any way to discard files starting with dot(.) or ending with .tmp
> in the hive partition while reading fro
Hi,
Is there any way to discard files starting with dot(.) or ending with .tmp
in the hive partition while reading from Hive table using spark.read.table
method.
I tried using PathFilters but they didn't work. I am using spark-submit and
passing my python file(pyspark) containing the source code.
3, 2019 at 8:17 PM, Dhrubajyoti Hati
> wrote:
>
>> I was wondering if anyone could help with this question.
>>
>> On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati,
>> wrote:
>>
>>> Hi,
>>>
>>> I have a question regarding passing a dictionary f
I was wondering if anyone could help with this question.
On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati,
wrote:
> Hi,
>
> I have a question regarding passing a dictionary from driver to executors
> in spark on yarn. This dictionary is needed in an udf. I am using pyspark.
>
&g
Hi,
I have a question regarding passing a dictionary from driver to executors
in spark on yarn. This dictionary is needed in an udf. I am using pyspark.
As I understand this can be passed in two ways:
1. Broadcast the variable and then use it in the udfs
2. Pass the dictionary in the udf itself
Also the performance remains identical when running the same script from
jupyter terminal instead or normal terminal. In the script the spark
context is created by
spark = SparkSession \
.builder \
..
..
getOrCreate() command
On Wed, Sep 11, 2019 at 10:28 PM Dhrubajyoti Hati
wrote:
>
e you creating the Spark Session in jupyter ?
>
>
> On Wed, Sep 11, 2019 at 7:33 PM Dhrubajyoti Hati
> wrote:
>
>> But would it be the case for multiple tasks running on the same worker
>> and also both the tasks are running in client mode, so the one true is true
>&g
eight
> minutes.
>
> On Wed, Sep 11, 2019 at 3:17 AM Dhrubajyoti Hati
> wrote:
>
>> Hi,
>>
>> I just ran the same script in a shell in jupyter notebook and find the
>> performance to be similar. So I can confirm this is because the libraries
>> used jupyter
.
*Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*
On Wed, Sep 11, 2019 at 9:45 AM Dhrubajyoti Hati
wrote:
> Just checked from where the script is submitted i.e. wrt Driver, the
> python env are different. Jupyter one is running within a the virtual
> environment which is Python
but in any case: are they
>> both running against the same spark cluster with the same configuration
>> parameters especially executor memory and number of workers?
>>
>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
>> dhruba.w...@gmail.com>
>
> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
> dhruba.w...@gmail.com>:
>
>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>> the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs
>> co
sks for each.
>
> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati
> wrote:
>
>> Hi,
>>
>> I am facing a weird behaviour while running a python script. Here is what
>> the code looks like mostly:
>>
>> def fn1(ip):
>>some code.
Hi,
I am facing a weird behaviour while running a python script. Here is what
the code looks like mostly:
def fn1(ip):
some code...
...
def fn2(row):
...
some operations
...
return row1
udf_fn1 = udf(fn1)
cdf = spark.read.table("") //hive table is of size > 500 Gigs
ly directly translate
> to heap usage. Here you just need a bit more memory.
>
> On Mon, Jul 29, 2019 at 9:03 AM Dhrubajyoti Hati
> wrote:
> >
> > Hi Sean,
> >
> > Yeah I checked the heap, its almost full. I checked the GC logs in the
> executors where I f
memory'. I don't see that
> you've checked heap usage - is it nearly full?
> The answer isn't tuning but more heap.
> (Sometimes with really big heaps the problem is big pauses, but that's
> not the case here.)
>
> On Mon, Jul 29, 2019 at 1:26 AM Dhrubajy
sually
> also requires more memory for the executor, but less executors. Similarly
> the executor instances might be too many and they may not have enough heap.
> You can also increase the memory of the executor.
>
> Am 29.07.2019 um 08:22 schrieb Dhrubajyoti Hati :
>
> Hi,
>
Hi,
We were running Logistic Regression in Spark 2.2.X and then we tried to see
how does it do in Spark 2.3.X. Now we are facing an issue while running a
Logistic Regression Model in Spark 2.3.X on top of Yarn(GCP-Dataproc). In
the TreeAggregate method it takes a huge time due to very High GC Acti
egards,*
*​Dhrubajyoti Hati*
*LinkedIn <https://www.linkedin.com/in/dhrubajyoti-hati-9213a92a/>*
19 matches
Mail list logo