Re: Large number of conf broadcasts

2015-12-18 Thread Anders Arpteg
ders Arpteg, user > > Subject: Re: Large number of conf broadcasts > > https://github.com/databricks/spark-avro/pull/95 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Davro_pull_95=CwMFaQ=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29

Re: Large number of conf broadcasts

2015-12-17 Thread Prasad Ravilla
Kuipers Cc: user Subject: Re: Large number of conf broadcasts Yes, seems unnecessary. I actually tried patching the com.databricks.spark.avro reader to only broadcast once per dataset, instead of every single file/partition. It seems to work just as fine, and there are significantly less

Re: Large number of conf broadcasts

2015-12-17 Thread Koert Kuipers
our patch part of a pull request from the master branch in github? > > Thanks, > Prasad. > > From: Anders Arpteg > Date: Thursday, October 22, 2015 at 10:37 AM > To: Koert Kuipers > Cc: user > Subject: Re: Large number of conf broadcasts > > Yes, seems unnecessary.

Re: Large number of conf broadcasts

2015-12-17 Thread Prasad Ravilla
Thanks, Koert. Regards, Prasad. From: Koert Kuipers Date: Thursday, December 17, 2015 at 1:06 PM To: Prasad Ravilla Cc: Anders Arpteg, user Subject: Re: Large number of conf broadcasts https://github.com/databricks/spark-avro/pull/95<https://urldefense.proofpoint.com/v2/url?u=ht

Re: Large number of conf broadcasts

2015-10-26 Thread Anders Arpteg
Nice Koert, lets hope it gets merged soon. /Anders On Fri, Oct 23, 2015 at 6:32 PM Koert Kuipers wrote: > https://github.com/databricks/spark-avro/pull/95 > > On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers wrote: > >> oh no wonder... it undoes the glob (i

Re: Large number of conf broadcasts

2015-10-23 Thread Koert Kuipers
oh no wonder... it undoes the glob (i was reading from /some/path/*), creates a hadoopRdd for every path, and then creates a union of them using UnionRDD. thats not what i want... no need to do union. AvroInpuFormat already has the ability to handle globs (or multiple paths comma separated) very

Re: Large number of conf broadcasts

2015-10-23 Thread Koert Kuipers
https://github.com/databricks/spark-avro/pull/95 On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers wrote: > oh no wonder... it undoes the glob (i was reading from /some/path/*), > creates a hadoopRdd for every path, and then creates a union of them using > UnionRDD. > > thats

Re: Large number of conf broadcasts

2015-10-22 Thread Anders Arpteg
Yes, seems unnecessary. I actually tried patching the com.databricks.spark.avro reader to only broadcast once per dataset, instead of every single file/partition. It seems to work just as fine, and there are significantly less broadcasts and not seeing out of memory issues any more. Strange that

Re: Large number of conf broadcasts

2015-10-22 Thread Koert Kuipers
i am seeing the same thing. its gona completely crazy creating broadcasts for the last 15 mins or so. killing it... On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg wrote: > Hi, > > Running spark 1.5.0 in yarn-client mode, and am curios in why there are so > many broadcast

Large number of conf broadcasts

2015-09-24 Thread Anders Arpteg
Hi, Running spark 1.5.0 in yarn-client mode, and am curios in why there are so many broadcast being done when loading datasets with large number of partitions/files. Have datasets with thousands of partitions, i.e. hdfs files in the avro folder, and sometime loading hundreds of these large