Caching dataframes and overwrite

2017-11-21 Thread Michael Artz
I have been interested in finding out why I am getting strange behavior
when running a certain spark job. The job will error out if I place an
action (A .show(1) method) either right after caching the DataFrame or
right before writing the dataframe back to hdfs. There is a very similar
post to Stackoverflow post here... Spark SQL SaveMode.Overwrite, getting
java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName'

.

Basically the other post explains, that when you read from the same hdfs
directory that you are writing to, and your SaveMode is "overwrite", then
you will get a java.io.FileNotFoundException. But here I am finding that
just moving where in the program the action is can give very different
results - either completing the program or giving this exception. I was
wondering if anyone can explain why Spark is not being consistent here?

 val myDF = spark.read.format("csv")
.option("header", "false")
.option("delimiter", "\t")
.schema(schema)
.load(myPath)

// If I cache it here or persist it then do an action after the cache,
it will occasionally
// not throw the error. This is when completely restarting the
SparkSession so there is no
// risk of another user interfering on the same JVM.
  myDF.cache()
  myDF.show(1)
// Below is just meant to be showing that we're are doing other "spark
dataframe transformations",
// but different transformations have both led to the weird behavior
so, I'm not being specific about
// what exactly the dataframe transformations are

val secondDF = mergeOtherDFsWithmyDF(myDF, otherDF, thirdDF)

val fourthDF = mergeTwoDFs(thirdDF, StringToCheck, fifthDF)

// Below is the same .show(1) action call as was previously done, only
this below
// action ALWAYS results in a successful completion and the above
.show(1) sometimes results
// in FileNotFoundException and sometimes results in successful
completion. The only
// thing that changes among test runs is only one is executed. Either
// **fourthDF.show(1) or myDF.show(1) is left commented out**
fourthDF.show(1)
fourthDF.write
.mode(writeMode)
.option("header", "false")
.option("delimiter", "\t")
.csv(myPath)


caching DataFrames

2015-09-23 Thread Zhang, Jingyu
I have A and B DataFrames
A has columns a11,a12, a21,a22
B has columns b11,b12, b21,b22

I persistent them in cache
1. A.Cache(),
2.  B.Cache()

Then, I persistent the subset in cache later

3. DataFrame A1 (a11,a12).cache()

4. DataFrame B1 (b11,b12).cache()

5. DataFrame AB1 (a11,a12,b11,b12).cahce()

Can you please tell me what happen for caching case (3,4, and 5) after A
and B cached?
How much  more memory do I need compare with Caching 1 and 2 only?

Thanks

Jingyu

-- 
This message and its attachments may contain legally privileged or 
confidential information. It is intended solely for the named addressee. If 
you are not the addressee indicated in this message or responsible for 
delivery of the message to the addressee, you may not copy or deliver this 
message or its attachments to anyone. Rather, you should permanently delete 
this message and its attachments and kindly notify the sender by reply 
e-mail. Any content of this message and its attachments which does not 
relate to the official business of the sending company must be taken not to 
have been sent or endorsed by that company or any of its related entities. 
No warranty is made that the e-mail or attachments are free from computer 
virus or other defect.


Re: caching DataFrames

2015-09-23 Thread Hemant Bhanawat
Two dataframes do not share cache storage in Spark. Hence it's immaterial
that how two dataFrames are related to each other. Both of them are going
to consume memory based on the data that they have.  So for your A1 and B1
you would need extra memory that would be equivalent to half the memory of
A/B.

You can check the storage that a dataFrame is consuming in the Spark UI's
Storage tab. http://host:4040/storage/



On Thu, Sep 24, 2015 at 5:37 AM, Zhang, Jingyu 
wrote:

> I have A and B DataFrames
> A has columns a11,a12, a21,a22
> B has columns b11,b12, b21,b22
>
> I persistent them in cache
> 1. A.Cache(),
> 2.  B.Cache()
>
> Then, I persistent the subset in cache later
>
> 3. DataFrame A1 (a11,a12).cache()
>
> 4. DataFrame B1 (b11,b12).cache()
>
> 5. DataFrame AB1 (a11,a12,b11,b12).cahce()
>
> Can you please tell me what happen for caching case (3,4, and 5) after A
> and B cached?
> How much  more memory do I need compare with Caching 1 and 2 only?
>
> Thanks
>
> Jingyu
>
> This message and its attachments may contain legally privileged or
> confidential information. It is intended solely for the named addressee. If
> you are not the addressee indicated in this message or responsible for
> delivery of the message to the addressee, you may not copy or deliver this
> message or its attachments to anyone. Rather, you should permanently delete
> this message and its attachments and kindly notify the sender by reply
> e-mail. Any content of this message and its attachments which does not
> relate to the official business of the sending company must be taken not to
> have been sent or endorsed by that company or any of its related entities.
> No warranty is made that the e-mail or attachments are free from computer
> virus or other defect.


Re: caching DataFrames

2015-09-23 Thread Zhang, Jingyu
Thanks Hemant,

I will generate a total report (dfA) with many columns from log data. After
the report (A) done. I will generate many detail reports (dfA1-dfAi) base
on the subset of the total report (dfA), those detail reports using
aggregate and  window functions, according on different rules. However,
some information will lost after aggregate or window functions.

In the end, few of the detail reports can be generate directly from subset
df, But, many of reports should get some information back from the total
report.  Thus, I consider if there are any performance benefit if I cache
both dfA and its subset. If so, how many memory that I should prepare for
them.



On 24 September 2015 at 14:56, Hemant Bhanawat  wrote:

> hit send button too early...
>
> However, why would you want to cache a dataFrame that is subset of already
> cached dataFrame.
>
> If dfA is cached, and dfA1 is created by applying some transformation on
> dfA, actions on dfA1 will use cache of dfA.
>
>
> val dfA1 = dfA.filter($"_1" > 50)
>
> // this will run on the cached data of A.
>
> dfA1.count()
>
>
>
> On Thu, Sep 24, 2015 at 10:20 AM, Hemant Bhanawat 
> wrote:
>
>> Two dataframes do not share cache storage in Spark. Hence it's immaterial
>> that how two dataFrames are related to each other. Both of them are going
>> to consume memory based on the data that they have.  So for your A1 and B1
>> you would need extra memory that would be equivalent to half the memory of
>> A/B.
>>
>> You can check the storage that a dataFrame is consuming in the Spark UI's
>> Storage tab. http://host:4040/storage/
>>
>>
>>
>> On Thu, Sep 24, 2015 at 5:37 AM, Zhang, Jingyu 
>> wrote:
>>
>>> I have A and B DataFrames
>>> A has columns a11,a12, a21,a22
>>> B has columns b11,b12, b21,b22
>>>
>>> I persistent them in cache
>>> 1. A.Cache(),
>>> 2.  B.Cache()
>>>
>>> Then, I persistent the subset in cache later
>>>
>>> 3. DataFrame A1 (a11,a12).cache()
>>>
>>> 4. DataFrame B1 (b11,b12).cache()
>>>
>>> 5. DataFrame AB1 (a11,a12,b11,b12).cahce()
>>>
>>> Can you please tell me what happen for caching case (3,4, and 5) after A
>>> and B cached?
>>> How much  more memory do I need compare with Caching 1 and 2 only?
>>>
>>> Thanks
>>>
>>> Jingyu
>>>
>>> This message and its attachments may contain legally privileged or
>>> confidential information. It is intended solely for the named addressee. If
>>> you are not the addressee indicated in this message or responsible for
>>> delivery of the message to the addressee, you may not copy or deliver this
>>> message or its attachments to anyone. Rather, you should permanently delete
>>> this message and its attachments and kindly notify the sender by reply
>>> e-mail. Any content of this message and its attachments which does not
>>> relate to the official business of the sending company must be taken not to
>>> have been sent or endorsed by that company or any of its related entities.
>>> No warranty is made that the e-mail or attachments are free from computer
>>> virus or other defect.
>>
>>
>>
>

-- 
This message and its attachments may contain legally privileged or 
confidential information. It is intended solely for the named addressee. If 
you are not the addressee indicated in this message or responsible for 
delivery of the message to the addressee, you may not copy or deliver this 
message or its attachments to anyone. Rather, you should permanently delete 
this message and its attachments and kindly notify the sender by reply 
e-mail. Any content of this message and its attachments which does not 
relate to the official business of the sending company must be taken not to 
have been sent or endorsed by that company or any of its related entities. 
No warranty is made that the e-mail or attachments are free from computer 
virus or other defect.