Re: HiveContext is Serialized?

Sean Owen Wed, 26 Oct 2016 01:07:26 -0700

It is the driver that has the info needed to schedule and manage
distributed jobs and that is by design.


This is narrowly about using the HiveContext or SparkContext directly. Of
course SQL operations are distributed.

On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Sean,
>
> Your point:
>
> "You can't use the HiveContext or SparkContext in a distribution
> operation..."
>
> Is this because of design issue?
>
> Case in point if I created a DF from RDD and register it as a tempTable,
> does this imply that any sql calls on that table islocalised and not
> distributed among executors?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote:
>
> Sean, thank you for making it clear. It was helpful.
>
> Regards,
> Ajay
>
>
> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>
> This usage is fine, because you are only using the HiveContext locally on
> the driver. It's applied in a function that's used on a Scala collection.
>
> You can't use the HiveContext or SparkContext in a distribution operation.
> It has nothing to do with for loops.
>
> The fact that they're serializable is misleading. It's there, I believe,
> because these objects may be inadvertently referenced in the closure of a
> function that executes remotely, yet doesn't use the context. The closure
> cleaner can't always remove this reference. The task would fail to
> serialize even though it doesn't use the context. You will find these
> objects serialize but then don't work if used remotely.
>
> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
> IIRC.
>
> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>
> Hi Everyone,
>
> I was thinking if I can use hiveContext inside foreach like below,
>
> object Test {
>   def main(args: Array[String]): Unit = {
>
>     val conf = new SparkConf()
>     val sc = new SparkContext(conf)
>     val hiveContext = new HiveContext(sc)
>
>     val dataElementsFile = args(0)
>     val deDF = 
> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>
>     def calculate(de: Row) {
>       val dataElement = de.getAs[String]("DataElement").trim
>       val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
> TEST_DB.TEST_TABLE1 ")
>       df1.write.insertInto("TEST_DB.TEST_TABLE1")
>     }
>
>     deDF.collect().foreach(calculate)
>   }
> }
>
>
> I looked at 
> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>  and I see it is extending SqlContext which extends Logging with Serializable.
>
> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>
> Regards,
>
> Ajay
>
>
>

Re: HiveContext is Serialized?

Reply via email to