unsubscribe
unsubscribe
Unsubscribe
Unsubscribe
unsubscribe
unsubscribe
Re: Scala vs Python for ETL with Spark
It's really a very big discussion around Pyspark Vs Scala. I have little bit experience about how we can automate the CI/CD when it's a JVM based language. I would like to take this as an opportunity to understand the end-to-end CI/CD flow for Pyspark based ETL pipelines. Could someone please list down the steps how the pipeline automation works when it comes to Pyspark based pipelines in Production ? //William On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven < wim.vanleu...@highestpoint.biz> wrote: > I think Sean is right, but in your argumentation you mention that > 'functionality > is sacrificed in favour of the availability of resources'. That's where I > disagree with you but agree with Sean. That is mostly not true. > > In your previous posts you also mentioned this . The only reason we > sometimes have to bail out to Scala is for performance with certain udfs > > On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh > wrote: > >> Thanks for the feedback Sean. >> >> Kind regards, >> >> Mich >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Thu, 22 Oct 2020 at 20:34, Sean Owen wrote: >> >>> I don't find this trolling; I agree with the observation that 'the >>> skills you have' are a valid and important determiner of what tools you >>> pick. >>> I disagree that you just have to pick the optimal tool for everything. >>> Sounds good until that comes in contact with the real world. >>> For Spark, Python vs Scala just doesn't matter a lot, especially if >>> you're doing DataFrame operations. By design. So I can't see there being >>> one answer to this. >>> >>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta < >>> gourav.sengu...@gmail.com> wrote: >>> >>>> Hi Mich, >>>> >>>> this is turning into a troll now, can you please stop this? >>>> >>>> No one uses Scala where Python should be used, and no one uses Python >>>> where Scala should be used - it all depends on requirements. Everyone >>>> understands polyglot programming and how to use relevant technologies best >>>> to their advantage. >>>> >>>> >>>> Regards, >>>> Gourav Sengupta >>>> >>>> >>>>>> -- Regards, William R +919037075164
HDP 3.1 spark Kafka dependency
Hi, I am finding difficulty in getting the proper Kafka lib's for spark. The version of HDP is 3.1 and i tried the below lib's but it produces the below issues. *POM entry :* org.apache.kafka kafka-clients 2.0.0.3.1.0.0-78 org.apache.kafka kafka_2.11 2.0.0.3.1.0.0-78 org.apache.spark spark-sql_${scala.compat.version} ${spark.version} provided org.apache.spark spark-core_2.11 2.3.2.3.1.0.0-78 provided org.apache.spark spark-streaming_2.11 2.3.2.3.1.0.0-78 *Issues while spark-submit :* Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:639) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164) at com.example.ReadDataFromKafka$.main(ReadDataFromKafka.scala:18) at com.example.ReadDataFromKafka.main(ReadDataFromKafka.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:904) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:382) Could someone help me if i am doing something wrong ? *Spark Submit:* export KAFKA_KERBEROS_PARAMS="-Djava.security.auth.login.config=kafka.consumer.properties" export KAFKA_OPTS="-Djava.security.auth.login.config=kafka.consumer.properties" export SPARK_KAFKA_VERSION=NONE spark-submit --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.conf=kafka.consumer.properties" --files "kafka.consumer.properties" --class com.example.ReadDataFromKafka HelloKafka-1.0-SNAPSHOT.jar *Consumer Code : * https://sparkbyexamples.com/spark/spark-batch-processing-produce-consume-kafka-topic/ Regards, William R