Re: Broadcasting a parquet file using spark and python

Michael Armbrust Wed, 01 Apr 2015 11:03:57 -0700

You will need to create a hive parquet table that points to the data and
run "ANALYZE TABLE tableName noscan" so that we have statistics on the size.


On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra <jitesh...@gmail.com>
wrote:

> Hi Michael,
>
> Thanks for your response. I am running 1.2.1.
>
> Is there any workaround to achieve the same with 1.2.1?
>
> Thanks,
> Jitesh
>
> On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> In Spark 1.3 I would expect this to happen automatically when the parquet
>> table is small (< 10mb, configurable with 
>> spark.sql.autoBroadcastJoinThreshold).
>> If you are running 1.3 and not seeing this, can you show the code you are
>> using to create the table?
>>
>> On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 <jitesh...@gmail.com> wrote:
>>
>>> How can we implement a BroadcastHashJoin for spark with python?
>>>
>>> My SparkSQL inner joins are taking a lot of time since it is performing
>>> ShuffledHashJoin.
>>>
>>> Tables on which join is performed are stored as parquet files.
>>>
>>> Please help.
>>>
>>> Thanks and regards,
>>> Jitesh
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Broadcasting a parquet file using spark and python

Reply via email to