Thanks Sean.
I believe you are referring to below statement
"You can't use the HiveContext or SparkContext in a distribution operation.
It has nothing to do with for loops.
The fact that they're serializable is misleading. It's there, I believe,
because these objects may be inadvertently
Yes, but the question here is why the context objects are marked
serializable when they are not meant to be sent somewhere as bytes. I tried
to answer that apparent inconsistency below.
On Wed, Oct 26, 2016, 10:21 Mich Talebzadeh
wrote:
> Hi,
>
> Sorry for asking this
Hi,
Sorry for asking this rather naïve question.
The notion of serialisation in Spark and where it can be serialised or not.
Does this generally refer to the concept of serialisation in the context of
data storage?
In this context for example with reference to RDD operations is it process
of
It is the driver that has the info needed to schedule and manage
distributed jobs and that is by design.
This is narrowly about using the HiveContext or SparkContext directly. Of
course SQL operations are distributed.
On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh
wrote:
In your use case, your dedf need not to be a data frame. You could use
SC.textFile().collect.
Even better you can just read off a local file, as your file is very small,
unless you are planning to use yarn cluster mode.
On 26 Oct 2016 16:43, "Ajay Chander" wrote:
> Sean,
Hi Sean,
Your point:
"You can't use the HiveContext or SparkContext in a distribution
operation..."
Is this because of design issue?
Case in point if I created a DF from RDD and register it as a tempTable,
does this imply that any sql calls on that table islocalised and not
distributed among
Sean, thank you for making it clear. It was helpful.
Regards,
Ajay
On Wednesday, October 26, 2016, Sean Owen wrote:
> This usage is fine, because you are only using the HiveContext locally on
> the driver. It's applied in a function that's used on a Scala collection.
>
>
Thanks for the response Sean. I have seen the NPE on similar issues very
consistently and assumed that could be the reason :) Thanks for clarifying.
regards
Sunita
On Tue, Oct 25, 2016 at 10:11 PM, Sean Owen wrote:
> This usage is fine, because you are only using the
This usage is fine, because you are only using the HiveContext locally on
the driver. It's applied in a function that's used on a Scala collection.
You can't use the HiveContext or SparkContext in a distribution operation.
It has nothing to do with for loops.
The fact that they're serializable
Sunita, Thanks for your time. In my scenario, based on each attribute from
deDF(1 column with just 66 rows), I have to query a Hive table and insert
into another table.
Thanks,
Ajay
On Wed, Oct 26, 2016 at 12:21 AM, Sunita Arvind
wrote:
> Ajay,
>
> Afaik Generally these
Ajay,
Afaik Generally these contexts cannot be accessed within loops. The sql
query itself would run on distributed datasets so it's a parallel
execution. Putting them in foreach would make it nested in nested. So
serialization would become hard. Not sure I could explain it right.
If you can
Jeff, Thanks for your response. I see below error in the logs. You think it
has to do anything with hiveContext ? Do I have to serialize it before
using inside foreach ?
16/10/19 15:16:23 ERROR scheduler.LiveListenerBus: Listener SQLListener
threw an exception
java.lang.NullPointerException
In your sample code, you can use hiveContext in the foreach as it is scala
List foreach operation which runs in driver side. But you cannot use
hiveContext in RDD.foreach
Ajay Chander 于2016年10月26日周三 上午11:28写道:
> Hi Everyone,
>
> I was thinking if I can use hiveContext
Hi Everyone,
I was thinking if I can use hiveContext inside foreach like below,
object Test {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val dataElementsFile = args(0)
val deDF =
14 matches
Mail list logo