Re: spark as data warehouse?

2022-03-25 Thread Deepak Sharma
It can be used as warehouse but then you have to keep long running spark
jobs.
This can be possible using cached data frames or dataset .

Thanks
Deepak

On Sat, 26 Mar 2022 at 5:56 AM,  wrote:

> In the past time we have been using hive for building the data
> warehouse.
> Do you think if spark can used for this purpose? it's even more realtime
> than hive.
>
> Thanks.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


spark as data warehouse?

2022-03-25 Thread capitnfrakass
In the past time we have been using hive for building the data 
warehouse.
Do you think if spark can used for this purpose? it's even more realtime 
than hive.


Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question for so many SQL tools

2022-03-25 Thread Bjørn Jørgensen
No they are not doing the same thing.
But everyone knows SQL. SQL has been there since 1972.

Apache Drill is for NoSQL
Spark is for everything you will do with data.

All of them have their pros and cons. You just have to find what's best for
your task.


fre. 25. mar. 2022 kl. 22:32 skrev Bitfox :

> Just a question why there are so many SQL based tools existing for data
> jobs?
>
> The ones I know,
>
> Spark
> Flink
> Ignite
> Impala
> Drill
> Hive
> …
>
> They are doing the similar jobs IMO.
> Thanks
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: GraphX Support

2022-03-25 Thread Bjørn Jørgensen
Yes, MLlib  is actively developed. You can
have a look at github and filter on closed and ML github and filter on
closed and ML




fre. 25. mar. 2022 kl. 22:15 skrev Bitfox :

> BTW , is MLlib still in active development?
>
> Thanks
>
> On Tue, Mar 22, 2022 at 07:11 Sean Owen  wrote:
>
>> GraphX is not active, though still there and does continue to build and
>> test with each Spark release. GraphFrames kind of superseded it, but is
>> also not super active FWIW.
>>
>> On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez
>>  wrote:
>>
>>> Hello!
>>>
>>>
>>>
>>> My team and I are evaluating GraphX as a possible solution. Would
>>> someone be able to speak to the support of this Spark feature? Is there
>>> active development or is GraphX in maintenance mode (e.g. updated to ensure
>>> functionality with new Spark releases)?
>>>
>>>
>>>
>>> Thanks in advance for your help!
>>>
>>>
>>>
>>> --
>>>
>>> Jacob H. Marquez
>>>
>>> He/Him
>>>
>>> Data & Applied Scientist
>>>
>>> Microsoft Cloud Data Sciences
>>>
>>>
>>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Question for so many SQL tools

2022-03-25 Thread Bitfox
Just a question why there are so many SQL based tools existing for data
jobs?

The ones I know,

Spark
Flink
Ignite
Impala
Drill
Hive
…

They are doing the similar jobs IMO.
Thanks


Re: GraphX Support

2022-03-25 Thread Bitfox
BTW , is MLlib still in active development?

Thanks

On Tue, Mar 22, 2022 at 07:11 Sean Owen  wrote:

> GraphX is not active, though still there and does continue to build and
> test with each Spark release. GraphFrames kind of superseded it, but is
> also not super active FWIW.
>
> On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez 
> wrote:
>
>> Hello!
>>
>>
>>
>> My team and I are evaluating GraphX as a possible solution. Would someone
>> be able to speak to the support of this Spark feature? Is there active
>> development or is GraphX in maintenance mode (e.g. updated to ensure
>> functionality with new Spark releases)?
>>
>>
>>
>> Thanks in advance for your help!
>>
>>
>>
>> --
>>
>> Jacob H. Marquez
>>
>> He/Him
>>
>> Data & Applied Scientist
>>
>> Microsoft Cloud Data Sciences
>>
>>
>>
>


Re: [EXTERNAL] Re: GraphX Support

2022-03-25 Thread Bjørn Jørgensen
One alternative can be to use Spark and ArangoDB 

Introducing the new ArangoDB Datasource for Apache Spark



ArongoDB is a open source graphs DB with a lot of good graphs utils and
documentation 

tir. 22. mar. 2022 kl. 00:49 skrev Jacob Marquez
:

> Awesome, thank you!
>
>
>
> *From:* Sean Owen 
> *Sent:* Monday, March 21, 2022 4:11 PM
> *To:* Jacob Marquez 
> *Cc:* user@spark.apache.org
> *Subject:* [EXTERNAL] Re: GraphX Support
>
>
>
> You don't often get email from sro...@gmail.com. Learn why this is
> important 
>
> GraphX is not active, though still there and does continue to build and
> test with each Spark release. GraphFrames kind of superseded it, but is
> also not super active FWIW.
>
>
>
> On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez <
> jac...@microsoft.com.invalid> wrote:
>
> Hello!
>
>
>
> My team and I are evaluating GraphX as a possible solution. Would someone
> be able to speak to the support of this Spark feature? Is there active
> development or is GraphX in maintenance mode (e.g. updated to ensure
> functionality with new Spark releases)?
>
>
>
> Thanks in advance for your help!
>
>
>
> --
>
> Jacob H. Marquez
>
> He/Him
>
> Data & Applied Scientist
>
> Microsoft Cloud Data Sciences
>
>
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

2022-03-25 Thread Gourav Sengupta
Hi,

completely agree with Alex, also if you are just writing to Cassandra then
what is the purpose of writing to Kafka broker?

Generally people just find it sound as if adding more components to their
architecture is great, but sadly it is not. Remove the Kafka broker, incase
you are not broadcasting your messages to a set of wider solutions. Also
SPARK is an overkill in the way you are using it.

There are fantastic solutions available in the market like Presto SQL, Big
Query, Redshift, Athena, Snowflake, etc and SPARK is just one of the tools
and often a difficult one to configure and run.

Regards,
Gourav Sengupta

On Fri, Mar 25, 2022 at 1:19 PM Alex Ott  wrote:

> You don't need to use foreachBatch to write to Cassandra. You just need to
> use Spark Cassandra Connector version 2.5.0 or higher - it supports native
> writing of stream data into Cassandra.
>
> Here is an announcement:
> https://www.datastax.com/blog/advanced-apache-cassandra-analytics-now-open-all
>
> guillaume farcy  at "Mon, 21 Mar 2022 16:33:51 +0100" wrote:
>  gf> Hello,
>
>  gf> I am a student and I am currently doing a big data project.
>  gf> Here is my code:
>  gf> https://gist.github.com/Balykoo/262d94a7073d5a7e16dfb0d0a576b9c3
>
>  gf> My project is to retrieve messages from a twitch chat and send them
> into kafka then spark
>  gf> reads the kafka topic to perform the processing in the provided gist.
>
>  gf> I will want to send these messages into cassandra.
>
>  gf> I tested a first solution on line 72 which works but when there are
> too many messages
>  gf> spark crashes. Probably due to the fact that my function connects to
> cassandra each time
>  gf> it is called.
>
>  gf> I tried the object approach to mutualize the connection object but
> without success:
>  gf> _pickle.PicklingError: Could not serialize object: TypeError: cannot
> pickle
>  gf> '_thread.RLock' object
>
>  gf> Can you please tell me how to do this?
>  gf> Or at least give me some advice?
>
>  gf> Sincerely,
>  gf> FARCY Guillaume.
>
>
>
>  gf> -
>  gf> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> --
> With best wishes,Alex Ott
> http://alexott.net/
> Twitter: alexott_en (English), alexott (Russian)
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

2022-03-25 Thread Alex Ott
You don't need to use foreachBatch to write to Cassandra. You just need to
use Spark Cassandra Connector version 2.5.0 or higher - it supports native
writing of stream data into Cassandra.

Here is an announcement: 
https://www.datastax.com/blog/advanced-apache-cassandra-analytics-now-open-all

guillaume farcy  at "Mon, 21 Mar 2022 16:33:51 +0100" wrote:
 gf> Hello,

 gf> I am a student and I am currently doing a big data project.
 gf> Here is my code:
 gf> https://gist.github.com/Balykoo/262d94a7073d5a7e16dfb0d0a576b9c3

 gf> My project is to retrieve messages from a twitch chat and send them into 
kafka then spark
 gf> reads the kafka topic to perform the processing in the provided gist.

 gf> I will want to send these messages into cassandra.

 gf> I tested a first solution on line 72 which works but when there are too 
many messages
 gf> spark crashes. Probably due to the fact that my function connects to 
cassandra each time
 gf> it is called.

 gf> I tried the object approach to mutualize the connection object but without 
success:
 gf> _pickle.PicklingError: Could not serialize object: TypeError: cannot pickle
 gf> '_thread.RLock' object

 gf> Can you please tell me how to do this?
 gf> Or at least give me some advice?

 gf> Sincerely,
 gf> FARCY Guillaume.



 gf> -
 gf> To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Cannot compare columns directly in IF...ELSE statement

2022-03-25 Thread Balakrishnan Ayyappan
Not sure if I understood the question correctly, but did you try using
`case when` ?


Thanks,
Bala

On Fri, Mar 25, 2022, 12:44 PM Sid  wrote:

> Hi Team,
>
> I need help with the below problem:
>
>
> https://stackoverflow.com/questions/71613292/how-to-use-columns-in-if-else-condition-in-pyspark
>
> Thanks,
> Sid
>


[ANNOUNCE] Apache Kyuubi (Incubating) released 1.5.0-incubating

2022-03-25 Thread Kent Yao
Hi all,

The Apache Kyuubi (Incubating) community is pleased to announce that
Apache Kyuubi (Incubating) 1.5.0-incubating has been released!

Apache Kyuubi (Incubating) is a distributed multi-tenant JDBC server for
large-scale data processing and analytics, built on top of Apache Spark
and designed to support more engines like Apache Flink(Beta), Trino(Beta)
and so on.

Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface
for end-users to manipulate large-scale data with pre-programmed and
extensible Spark SQL engines.

We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses
and data lakes.

This "out-of-the-box" model minimizes the barriers and costs for end-users
to use Spark at the client-side.

At the server-side, Kyuubi server and engine's multi-tenant architecture
provides the administrators a way to achieve computing resource isolation,
data security, high availability, high client concurrency, etc.

The full release notes and download links are available at:
Release notes: https://kyuubi.apache.org/release/1.5.0-incubating.html
Download page: https://kyuubi.apache.org/releases.html

To learn more about Apache Kyuubi (Incubating), please see
https://kyuubi.apache.org/

Kyuubi Resources:
- Issue Tracker: https://kyuubi.apache.org/issue_tracking.html
- Mailing list: https://kyuubi.apache.org/mailing_lists.html

We would like to thank all contributors of the Kyuubi community and
Incubating
community who made this release possible!

Thanks,
On behalf of Apache Kyuubi (Incubating) community


Cannot compare columns directly in IF...ELSE statement

2022-03-25 Thread Sid
Hi Team,

I need help with the below problem:

https://stackoverflow.com/questions/71613292/how-to-use-columns-in-if-else-condition-in-pyspark

Thanks,
Sid