unsubscribe

2023-03-06 Thread William R
unsubscribe


Unsubscribe

2022-02-14 Thread William R
Unsubscribe


unsubscribe

2021-10-16 Thread William R
unsubscribe


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread William R
It's really a very big discussion around Pyspark Vs Scala. I have little
bit experience about how we can automate the CI/CD when it's a JVM based
language.
I would like to take this as an opportunity to understand the end-to-end
CI/CD flow for Pyspark based ETL pipelines.

Could someone please list down the steps how the pipeline automation works
when it comes to Pyspark based pipelines in Production ?

//William

On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven <
wim.vanleu...@highestpoint.biz> wrote:

> I think Sean is right, but in your argumentation you mention that 
> 'functionality
> is sacrificed in favour of the availability of resources'. That's where I
> disagree with you but agree with Sean. That is mostly not true.
>
> In your previous posts you also mentioned this . The only reason we
> sometimes have to bail out to Scala is for performance with certain udfs
>
> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
> wrote:
>
>> Thanks for the feedback Sean.
>>
>> Kind regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>>
>>> I don't find this trolling; I agree with the observation that 'the
>>> skills you have' are a valid and important determiner of what tools you
>>> pick.
>>> I disagree that you just have to pick the optimal tool for everything.
>>> Sounds good until that comes in contact with the real world.
>>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>>> you're doing DataFrame operations. By design. So I can't see there being
>>> one answer to this.
>>>
>>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi Mich,
>>>>
>>>> this is turning into a troll now, can you please stop this?
>>>>
>>>> No one uses Scala where Python should be used, and no one uses Python
>>>> where Scala should be used - it all depends on requirements. Everyone
>>>> understands polyglot programming and how to use relevant technologies best
>>>> to their advantage.
>>>>
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>>
>>>>>>

-- 
Regards,
William R
+919037075164


HDP 3.1 spark Kafka dependency

2020-03-18 Thread William R
Hi,

I am finding difficulty in getting the proper Kafka lib's for spark. The
version of HDP is 3.1 and i tried the below lib's but it produces the below
issues.

*POM entry :*


org.apache.kafka
kafka-clients
2.0.0.3.1.0.0-78


org.apache.kafka
kafka_2.11
2.0.0.3.1.0.0-78



org.apache.spark
spark-sql_${scala.compat.version}
${spark.version}
provided



org.apache.spark
spark-core_2.11
2.3.2.3.1.0.0-78
provided


org.apache.spark
spark-streaming_2.11
2.3.2.3.1.0.0-78


*Issues while spark-submit :*

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find
data source: kafka. Please find packages at
http://spark.apache.org/third-party-projects.html
at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:639)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at com.example.ReadDataFromKafka$.main(ReadDataFromKafka.scala:18)
at com.example.ReadDataFromKafka.main(ReadDataFromKafka.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:904)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)


Could someone help me if i am doing something wrong ?

*Spark Submit:*

export
KAFKA_KERBEROS_PARAMS="-Djava.security.auth.login.config=kafka.consumer.properties"
export
KAFKA_OPTS="-Djava.security.auth.login.config=kafka.consumer.properties"
export SPARK_KAFKA_VERSION=NONE

spark-submit --conf
"spark.driver.extraJavaOptions=-Djava.security.auth.login.conf=kafka.consumer.properties"
--files "kafka.consumer.properties" --class com.example.ReadDataFromKafka
HelloKafka-1.0-SNAPSHOT.jar

*Consumer Code : *
https://sparkbyexamples.com/spark/spark-batch-processing-produce-consume-kafka-topic/


Regards,
William R