ders Arpteg, user
>
> Subject: Re: Large number of conf broadcasts
>
> https://github.com/databricks/spark-avro/pull/95
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Davro_pull_95=CwMFaQ=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29
Kuipers
Cc: user
Subject: Re: Large number of conf broadcasts
Yes, seems unnecessary. I actually tried patching the com.databricks.spark.avro
reader to only broadcast once per dataset, instead of every single
file/partition. It seems to work just as fine, and there are significantly less
our patch part of a pull request from the master branch in github?
>
> Thanks,
> Prasad.
>
> From: Anders Arpteg
> Date: Thursday, October 22, 2015 at 10:37 AM
> To: Koert Kuipers
> Cc: user
> Subject: Re: Large number of conf broadcasts
>
> Yes, seems unnecessary.
Thanks, Koert.
Regards,
Prasad.
From: Koert Kuipers
Date: Thursday, December 17, 2015 at 1:06 PM
To: Prasad Ravilla
Cc: Anders Arpteg, user
Subject: Re: Large number of conf broadcasts
https://github.com/databricks/spark-avro/pull/95<https://urldefense.proofpoint.com/v2/url?u=ht
Nice Koert, lets hope it gets merged soon.
/Anders
On Fri, Oct 23, 2015 at 6:32 PM Koert Kuipers wrote:
> https://github.com/databricks/spark-avro/pull/95
>
> On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers wrote:
>
>> oh no wonder... it undoes the glob (i
oh no wonder... it undoes the glob (i was reading from /some/path/*),
creates a hadoopRdd for every path, and then creates a union of them using
UnionRDD.
thats not what i want... no need to do union. AvroInpuFormat already has
the ability to handle globs (or multiple paths comma separated) very
https://github.com/databricks/spark-avro/pull/95
On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers wrote:
> oh no wonder... it undoes the glob (i was reading from /some/path/*),
> creates a hadoopRdd for every path, and then creates a union of them using
> UnionRDD.
>
> thats
Yes, seems unnecessary. I actually tried patching the
com.databricks.spark.avro reader to only broadcast once per dataset,
instead of every single file/partition. It seems to work just as fine, and
there are significantly less broadcasts and not seeing out of memory issues
any more. Strange that
i am seeing the same thing. its gona completely crazy creating broadcasts
for the last 15 mins or so. killing it...
On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg wrote:
> Hi,
>
> Running spark 1.5.0 in yarn-client mode, and am curios in why there are so
> many broadcast
Hi,
Running spark 1.5.0 in yarn-client mode, and am curios in why there are so
many broadcast being done when loading datasets with large number of
partitions/files. Have datasets with thousands of partitions, i.e. hdfs
files in the avro folder, and sometime loading hundreds of these large
10 matches
Mail list logo