RE: [EXTERNAL] Re: spark ETL and spark thrift server running together

2022-03-30 Thread Alex Kosberg
Hi Christophe, Thank you for the explanation! Regards, Alex From: Christophe Préaud Sent: Wednesday, March 30, 2022 3:43 PM To: Alex Kosberg ; user@spark.apache.org Subject: [EXTERNAL] Re: spark ETL and spark thrift server running together Hi Alex, As stated in the Hive documentation (https

Re: spark ETL and spark thrift server running together

2022-03-30 Thread Christophe Préaud
of Derby may have already booted the > database /tmp/metastore_db. > >   > > I need to be able to run PySpark (Spark ETL) while having spark thrift server > up for BI tool queries. Any workaround for it? > > Thanks! > >   > > > Notice: This e-mail togethe

spark ETL and spark thrift server running together

2022-03-30 Thread Alex Kosberg
XSDB6: Another instance of Derby may have already booted the database /tmp/metastore_db. I need to be able to run PySpark (Spark ETL) while having spark thrift server up for BI tool queries. Any workaround for it? Thanks! Notice: This e-mail together with any attachments may contain information of

Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Mich Talebzadeh
Many thanks Marco. Points noted and other points/criticism are equally welcome. In a forum like this we do not disagree, we just agree to differ so to speak and share ideas. I will review my code and take onboard your suggestions. regards, Mich LinkedIn *

Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Sofia’s World
Hey Mich my 2 cents on top of Jerry's. for reusable @fixtures across your tests, i'd leverage conftest.py and put all of them there -if number is not too big. OW. as you say, you can create tests\fixtures where you place all of them there in term of extractHiveDAta for a @fixture it is

Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Mich Talebzadeh
Interesting points Jerry. I do not know how much atomising the unit test brings benefit. For example we have @pytest.fixture(scope = "session") def extractHiveData(): # read data through jdbc from Hive spark_session = s.spark_session(ctest['common']['appName']) tableName =

Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Jerry Vinokurov
Sure, I think it makes sense in many cases to break things up like this. Looking at your other example I'd say that you might want to break up extractHiveData into several fixtures (one for session, one for config, one for the df) because in my experience fixtures like those are reused constantly

Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Mich Talebzadeh
Thanks Jerry for your comments. The easiest option and I concur is to have all these fixture files currently under fixtures package lumped together in conftest.py under * tests* package. Then you can get away all together from fixtures and it works. However, I gather plug and play becomes less

Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Jerry Vinokurov
Hi Mich, I'm a bit confused by what you mean when you say that you cannot call a fixture in another fixture. The fixtures resolve dependencies among themselves by means of their named parameters. So that means that if I have a fixture @pytest.fixture def fixture1(): return SomeObj() and

Testing ETL with Spark using Pytest

2021-02-09 Thread Mich Talebzadeh
I was a bit confused with the use of fixtures in Pytest with the dataframes passed as an input pipeline from one fixture to another. I wrote this after spending some time on it. As usual it is heuristic rather than anything overtly by the book so to speak. In PySpark and PyCharm you can ETTL from

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Sofia’s World
Hey My 2 cents on CI/Cd for pyspark. You can leverage pytests + holden karau's spark testing libs for CI thus giving you `almost` same functionality as Scala - I say almost as in Scala you have nice and descriptive funcspecs - For me choice is based on expertise.having worked with teams which

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Mich Talebzadeh
Hi Wim, I think we are splitting the atom here but my inference to functionality was based on: 1. Spark is written in Scala, so knowing Scala programming language helps coders navigate into the source code, if something does not function as expected. 2. Given the framework using

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread William R
It's really a very big discussion around Pyspark Vs Scala. I have little bit experience about how we can automate the CI/CD when it's a JVM based language. I would like to take this as an opportunity to understand the end-to-end CI/CD flow for Pyspark based ETL pipelines. Could someone please

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Wim Van Leuven
I think Sean is right, but in your argumentation you mention that 'functionality is sacrificed in favour of the availability of resources'. That's where I disagree with you but agree with Sean. That is mostly not true. In your previous posts you also mentioned this . The only reason we sometimes

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
Thanks for the feedback Sean. Kind regards, Mich LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * *Disclaimer:* Use it at your own risk. Any and all

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Sean Owen
I don't find this trolling; I agree with the observation that 'the skills you have' are a valid and important determiner of what tools you pick. I disagree that you just have to pick the optimal tool for everything. Sounds good until that comes in contact with the real world. For Spark, Python vs

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Gourav Sengupta
h may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh > wrote: > &g

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
. On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh wrote: > I have come across occasions when the teams use Python with Spark for ETL, > for example processing data from S3 buckets into Snowflake with Spark. > > The only reason I think they are choosing Python as opposed to Scala

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Magnus Nilsson
Holy war is a bit dramatic don't you think?  The difference between Scala and Python will always be very relevant when choosing between Spark and Pyspark. I wouldn't call it irrelevant to the original question. br, molotch On Sat, 17 Oct 2020 at 16:57, "Yuri Oleynikov (‫יורי אולייניקוב‬‎)" <

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Magnus Nilsson
I'm sorry you were offended. I'm not an expert in Python and I wasn't trying to attack you personally. It's just an opinion about what makes a language better or worse, it's not the single source of truth. You don't have to take offense. In the end its about context and what you're trying to

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Holden Karau
Scala and Python have their advantages and disadvantages with Spark. In my experience with performance is super important you’ll end up needing to do some of your work in the JVM, but in many situations what matters work is what your team and company are familiar with and the ecosystem of tooling

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
It seems that thread converted to holy war that has nothing to do with original question. If it is, it’s super disappointing Отправлено с iPhone > 17 окт. 2020 г., в 15:53, Molotch написал(а): > > I would say the pros and cons of Python vs Scala is both down to Spark, the > languages in

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Sasha Kacanski
And you are an expert on python! Idiomatic... Please do everyone a favor and stop commenting on things you have no idea... I build ETL systems python that wiped java commercial stacks left and right. Pyspark was and is and will be a second class citizen in spark world. That has nothing to do with

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Molotch
I would say the pros and cons of Python vs Scala is both down to Spark, the languages in themselves and what kind of data engineer you will get when you try to hire for the different solutions. With Pyspark you get less functionality and increased complexity with the py4j java interop compared

Re: Scala vs Python for ETL with Spark

2020-10-15 Thread Mich Talebzadeh
Python knowledge). However, Spark documents frequently state availability of features to Scala and Java and not Python. Looking around everything written for Spark using Python is a work-around. I am not considering Python for data science as my focus has been on using Python with Spark for ETL, I

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
t;>>>>> been >>>>>>>> around a long time (long being relative here). Most people either knew >>>>>>>> UNIX >>>>>>>> Shell, C, Python or Perl or a combination of all these. I recall we >>>>>>

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Gourav Sengupta
r Hadoop admin for root password to >>>>>>> log in to the edge node. Later he became head of machine learning >>>>>>> somewhere else and he loved C and Python. So Python was a gift in >>>>>>> disguise. >>>>>>>

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
pinions and not facts so to speak :) >>>>>> >>>>>> Cheers, >>>>>> >>>>>> >>>>>> Mich >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread ayan guha
gt;>>>> I think Python appeals to those who are very familiar with CLI and shell >>>>> programming (Not GUI fan). As some members alluded to there are more >>>>> people >>>>> around with Python knowledge. Most managers choose Python as the un

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
>>>> seen a manager who feels at home with Scala. So in summary it is a bit >>>> disappointing to abandon Scala and switch to Python just for the sake of >>>> it. >>>> >>>> Disclaimer: These are opinions and not facts so to speak

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread ayan guha
to abandon Scala and switch to Python just for the sake of it. >>> >>> Disclaimer: These are opinions and not facts so to speak :) >>> >>> Cheers, >>> >>> >>> Mich >>> >>> >>> >>> >>> >>>

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
gt;> >> Disclaimer: These are opinions and not facts so to speak :) >> >> Cheers, >> >> >> Mich >> >> >> >> >> >> >> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh >> wrote: >> >>> I have come across

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Stephen Boesch
at 21:56, Mich Talebzadeh > wrote: > >> I have come across occasions when the teams use Python with Spark for >> ETL, for example processing data from S3 buckets into Snowflake with Spark. >> >> The only reason I think they are choosing Python as opposed to Scala is >&

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Mich Talebzadeh
to Python just for the sake of it. Disclaimer: These are opinions and not facts so to speak :) Cheers, Mich On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh wrote: > I have come across occasions when the teams use Python with Spark for ETL, > for example processing data from S3 b

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jacek Pliszka
apply different > standards and criteria. And then it really depends on architecture aspects > etc. > > Am 09.10.2020 um 22:57 schrieb Mich Talebzadeh : > >  > I have come across occasions when the teams use Python with Spark for ETL, > for example processing data from S3 buckets

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jörn Franke
on architecture aspects etc. > Am 09.10.2020 um 22:57 schrieb Mich Talebzadeh : > >  > I have come across occasions when the teams use Python with Spark for ETL, > for example processing data from S3 buckets into Snowflake with Spark. > > The only reason I think they are choos

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Wim Van Leuven
lambdas you will hit some significant serialization >> penalties as well as have to run actual work code in python. As long as no >> lambdas are used, everything will operate with Catalyst compiled java code >> so there won't be a big difference between python and scala. >> >

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
d >>> be almost no difference between the Scala and Python dataframe code. Once >>> you introduce python lambdas you will hit some significant serialization >>> penalties as well as have to run actual work code in python. As long as no >>> lambdas are use

Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Russell Spitzer
rialization >> penalties as well as have to run actual work code in python. As long as no >> lambdas are used, everything will operate with Catalyst compiled java code >> so there won't be a big difference between python and scala. >> >> On Fri, Oct 9, 2020 a

Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Mich Talebzadeh
e > so there won't be a big difference between python and scala. > > On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh > wrote: > >> I have come across occasions when the teams use Python with Spark for >> ETL, for example processing data from S3 buckets into Snowflake with

Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Russell Spitzer
as no lambdas are used, everything will operate with Catalyst compiled java code so there won't be a big difference between python and scala. On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh wrote: > I have come across occasions when the teams use Python with Spark for ETL, > for example processing dat

Scala vs Python for ETL with Spark

2020-10-09 Thread Mich Talebzadeh
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark. The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala

Re: ETL Using Spark

2020-05-24 Thread vijay.bvp
Hi Avadhut Narayan JoshiThe use case is achievable using Spark. Connection to SQL Server possible as Mich mentioned below as longs as there a JDBC driver that can connect to SQL ServerFor a production workloads important points to consider, >> what is the QoS requirements for your case? at least

Re: ETL Using Spark

2020-05-21 Thread Mich Talebzadeh
netary damages arising from such loss, damage or destruction. On Thu, 21 May 2020 at 16:15, Avadhut Narayan Joshi wrote: > Hello Team > > > > I am working on ETL using Spark . > > > >- I am fetching streaming data from Confluent Kafka >- Wanted to do aggr

ETL Using Spark

2020-05-21 Thread Avadhut Narayan Joshi
Hello Team I am working on ETL using Spark . * I am fetching streaming data from Confluent Kafka * Wanted to do aggregations by combining streaming data with Data from SQL Server For achieving above use case 1. Can I fetch data from SQL Server into Spark based on where

Re: Etl with spark

2017-02-12 Thread Sam Elamin
Yup I ended up doing just that thank you both On Sun, 12 Feb 2017 at 18:33, Miguel Morales wrote: > You can parallelize the collection of s3 keys and then pass that to your > map function so that files are read in parallel. > > Sent from my iPhone > > On Feb 12, 2017, at

Re: Etl with spark

2017-02-12 Thread Miguel Morales
You can parallelize the collection of s3 keys and then pass that to your map function so that files are read in parallel. Sent from my iPhone > On Feb 12, 2017, at 9:41 AM, Sam Elamin wrote: > > thanks Ayan but i was hoping to remove the dependency on a file and just

Re: Etl with spark

2017-02-12 Thread Sam Elamin
thanks Ayan but i was hoping to remove the dependency on a file and just use in memory list or dictionary So from the reading I've done today it seems.the concept of a bespoke async method doesn't really apply in spsrk since the cluster deals with distributing the work load Am I mistaken?

Re: Etl with spark

2017-02-12 Thread ayan guha
You can store the list of keys (I believe you use them in source file path, right?) in a file, one key per line. Then you can read the file using sc.textFile (So you will get a RDD of file paths) and then apply your function as a map. r = sc.textFile(list_file).map(your_function) HTH On Sun,

Etl with spark

2017-02-12 Thread Sam Elamin
Hey folks Really simple question here. I currently have an etl pipeline that reads from s3 and saves the data to an endstore I have to read from a list of keys in s3 but I am doing a raw extract then saving. Only some of the extracts have a simple transformation but overall the code looks the