please care and vote for Chinese people under cruel autocracy of CCP, great thanks!

2019-08-28 Thread ant_fighter
Hi all,
Sorry for disturbing you guys. Though I don't think here as a proper place to 
do this, I need your help, your vote, your holy vote, for us Chinese, for 
conscience and justice, for better world.

In the over 70 years of ruling over China, the Chinese Communist Party has done 
many horrible things humans can think of. These malicious and evil deeds 
include but are not limited to: falsifying national history, suppression of 
freedom of speech and press, money laundering in the scale of trillions, live 
organ harvesting, sexual harassment and assault to underaged females, 
slaughtering innocent citizens with counter-revolutionary excuses, etc.

In light of the recent violent actions to Hong Kongers by the People's 
Liberation Army (PLA) disguised as Hong Kong Police Force, we the people 
petition to officially recognize the Chinese Communist Party as a terrorist 
organization.

PLEASE SIGNUP and VOTE for us:
https://petitions.whitehouse.gov/petition/call-official-recognition-chinese-communist-party-terrorist-organization

Thanks again for all!

nameless, an ant fighter
2019.8.29

[python 2.4.3] correlation matrix

2019-08-28 Thread Rishi Shah
Hi All,

What is the best way to calculate correlation matrix?

-- 
Regards,

Rishi Shah


Re: Structured Streaming Dataframe Size

2019-08-28 Thread Nick Dawes
Thank you, TD. Couple of follow up questions please.

1) "It only keeps around the minimal intermediate state data"

How do you define "minimal" here? Is there a configuration property to
control the time or size of Streaming Dataframe?

2) I'm not writing anything out to any database or S3. My requirement is to
find out a count (real-time) in a 1 hour window. I would like to get this
count from a BI tool. So can register as a temp view and access from BI
tool?

I tried something like this In my Streaming application

AggStreamingDF.createOrReplaceGlobalTempView("streaming_table")

Then, In BI tool, I queried like this...

select * from streaming_table

Error:  Queries with streaming sources must be executed with
writeStream.start()

Any suggestions to make this work?

Thank you very much for your help!


On Tue, Aug 27, 2019, 6:42 PM Tathagata Das 
wrote:

>
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
>
> *Note that Structured Streaming does not materialize the entire table*.
>> It reads the latest available data from the streaming data source,
>> processes it incrementally to update the result, and then discards the
>> source data. It only keeps around the minimal intermediate *state* data
>> as required to update the result (e.g. intermediate counts in the earlier
>> example).
>>
>
>
> On Tue, Aug 27, 2019 at 1:21 PM Nick Dawes  wrote:
>
>> I have a quick newbie question.
>>
>> Spark Structured Streaming creates an unbounded dataframe that keeps
>> appending rows to it.
>>
>> So what's the max size of data it can hold? What if the size becomes
>> bigger than the JVM? Will it spill to disk? I'm using S3 as storage. So
>> will it write temp data on S3 or on local file system of the cluster?
>>
>> Nick
>>
>


Re: Caching tables in spark

2019-08-28 Thread Tzahi File
I mean two separate spark jobs



On Wed, Aug 28, 2019 at 2:25 PM Subash Prabakar 
wrote:

> When you mean by process is it two separate spark jobs? Or two stages
> within same spark code?
>
> Thanks
> Subash
>
> On Wed, 28 Aug 2019 at 19:06,  wrote:
>
>> Take a look at this article
>>
>>
>>
>>
>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html
>>
>>
>>
>> *From:* Tzahi File 
>> *Sent:* Wednesday, August 28, 2019 5:18 AM
>> *To:* user 
>> *Subject:* Caching tables in spark
>>
>>
>>
>> Hi,
>>
>>
>>
>> Looking for your knowledge with some question.
>>
>> I have 2 different processes that read from the same raw data table
>> (around 1.5 TB).
>>
>> Is there a way to read this data once and cache it somehow and to use
>> this data in both processes?
>>
>>
>>
>>
>>
>> Thanks
>>
>> --
>>
>> *Tzahi File*
>> Data Engineer
>>
>> [image: ironSource] 
>>
>> *email* tzahi.f...@ironsrc.com
>>
>> *mobile* +972-546864835
>>
>> *fax* +972-77-5448273
>>
>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>> 
>>
>> *ironsrc.com* 
>>
>> [image: linkedin] [image:
>> twitter] [image: facebook]
>> [image: googleplus]
>> 
>>
>> This email (including any attachments) is for the sole use of the
>> intended recipient and may contain confidential information which may be
>> protected by legal privilege. If you are not the intended recipient, or the
>> employee or agent responsible for delivering it to the intended recipient,
>> you are hereby notified that any use, dissemination, distribution or
>> copying of this communication and/or its content is strictly prohibited. If
>> you are not the intended recipient, please immediately notify us by reply
>> email or by telephone, delete this email and destroy any copies. Thank you.
>>
>>
>>
>

-- 
Tzahi File
Data Engineer
[image: ironSource] 

email tzahi.f...@ironsrc.com
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com 
[image: linkedin] [image:
twitter] [image: facebook]
[image: googleplus]

This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.


Re: Caching tables in spark

2019-08-28 Thread Subash Prabakar
When you mean by process is it two separate spark jobs? Or two stages
within same spark code?

Thanks
Subash

On Wed, 28 Aug 2019 at 19:06,  wrote:

> Take a look at this article
>
>
>
>
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html
>
>
>
> *From:* Tzahi File 
> *Sent:* Wednesday, August 28, 2019 5:18 AM
> *To:* user 
> *Subject:* Caching tables in spark
>
>
>
> Hi,
>
>
>
> Looking for your knowledge with some question.
>
> I have 2 different processes that read from the same raw data table
> (around 1.5 TB).
>
> Is there a way to read this data once and cache it somehow and to use this
> data in both processes?
>
>
>
>
>
> Thanks
>
> --
>
> *Tzahi File*
> Data Engineer
>
> [image: ironSource] 
>
> *email* tzahi.f...@ironsrc.com
>
> *mobile* +972-546864835
>
> *fax* +972-77-5448273
>
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> 
>
> *ironsrc.com* 
>
> [image: linkedin] [image:
> twitter] [image: facebook]
> [image: googleplus]
> 
>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>
>
>


RE: Caching tables in spark

2019-08-28 Thread email
Take a look at this article 

 

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html

 

From: Tzahi File  
Sent: Wednesday, August 28, 2019 5:18 AM
To: user 
Subject: Caching tables in spark

 

Hi, 

 

Looking for your knowledge with some question. 

I have 2 different processes that read from the same raw data table (around 1.5 
TB). 

Is there a way to read this data once and cache it somehow and to use this data 
in both processes? 

 

 

Thanks

-- 


Tzahi File
Data Engineer


  


email   tzahi.f...@ironsrc.com

mobile   +972-546864835

fax +972-77-5448273

ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv


  ironsrc.com


   
    
 


This email (including any attachments) is for the sole use of the intended 
recipient and may contain confidential information which may be protected by 
legal privilege. If you are not the intended recipient, or the employee or 
agent responsible for delivering it to the intended recipient, you are hereby 
notified that any use, dissemination, distribution or copying of this 
communication and/or its content is strictly prohibited. If you are not the 
intended recipient, please immediately notify us by reply email or by 
telephone, delete this email and destroy any copies. Thank you.

 



Caching tables in spark

2019-08-28 Thread Tzahi File
Hi,

Looking for your knowledge with some question.
I have 2 different processes that read from the same raw data table (around
1.5 TB).
Is there a way to read this data once and cache it somehow and to use this
data in both processes?


Thanks
-- 
Tzahi File
Data Engineer
[image: ironSource] 

email tzahi.f...@ironsrc.com
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com 
[image: linkedin] [image:
twitter] [image: facebook]
[image: googleplus]

This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.


What is directory "/path/_spark_metadata" for?

2019-08-28 Thread Mark Zhao
 Hey,

 When running Spark on Alluxio-1.8.2, I encounter the following exception:
“alluxio.exception.FileDoseNotExistException: Path
“/test-data/_spark_metadata” does not exist” in Alluxio master.log. What
exactly is the directory "_spark_metadata" used for? And how can I fix this
problem?

Thanks.

Mark


Low cache hit ratio when running Spark on Alluxio

2019-08-28 Thread Jerry Yan
 Hi,

We are running Spark jobs on an Alluxio Cluster which is serving 13
gigabytes of data with 99% of the data is in memory. I was hoping to speed
up the Spark jobs by reading the in-memory data in Alluxio, but found
Alluxio local hit rate is only 1.68%, while Alluxio remote hit rate is
98.32%. By monitoring the network IO across all worker nodes through
"dstat" command, I found that only two nodes had about 1GB of recv or send
in the whole precessand, and it is sending  1GB or receiving 1GB during
Spark Shuffle Stage. Is there any metrics I could check or configuration to
tune ?


Best,

Jerry


How to improve loading data into Cassandra table in this scenario?

2019-08-28 Thread Shyam P
>
> updated the issue content.
>

https://stackoverflow.com/questions/57684972/how-to-improve-performance-my-spark-job-here-to-load-data-into-cassandra-table


Thank you.