KafkaUtils module not found on spark 3 pyspark

2021-02-16 Thread aupres
I use hadoop 3.3.0 and spark 3.0.1-bin-hadoop3.2. And my python ide is
eclipse version 2020-12. I try to develop python application with KafkaUtils
pyspark module. My configuration reference of pyspark and eclipse is  this
site

 
. Simple codes like below work well without exception.


from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("Kafka2RDD").setMaster("local[*]")
sc = SparkContext(conf = conf)
data = [1, 2, 3, 4, 5, 6]
distData = sc.parallelize(data)

print(distData.count())


But I found the spark 3 pyspark module does not contain KafkaUtils at all.
The below codes can not import KafkaUtils.


from pyspark.streaming.kafka import KafkaUtils 
from pyspark.streaming.kafka import OffsetRange


So, I downgrade spark from 3.0.1-bin-hadoop3.2 to 2.4.7-bin-hadoop2.7. Then
I can sucsessfully import KafkaUtils on eclipse ide. But this time the
exceptions related with spark version are thrown continuously.


Traceback (most recent call last):
  File
"/home/jhwang/eclipse-workspace/BigData_Etl_Python/com/aaa/etl/kafka_spark_rdd.py",
line 36, in 
print(distData.count())
  File "/usr/local/spark/python/pyspark/rdd.py", line 1055, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/usr/local/spark/python/pyspark/rdd.py", line 1046, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/usr/local/spark/python/pyspark/rdd.py", line 917, in fold
vals = self.mapPartitions(func).collect()
  File "/usr/local/spark/python/pyspark/rdd.py", line 816, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File
"/usr/python/anaconda3/lib/python3.7/site-packages/py4j/java_gateway.py",
line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/usr/python/anaconda3/lib/python3.7/site-packages/py4j/protocol.py",
line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.IllegalArgumentException: Unsupported class file major version
55
at org.apache.xbean.asm6.ClassReader.(ClassReader.java:166)
at org.apache.xbean.asm6.ClassReader.(ClassReader.java:148)
at org.apache.xbean.asm6.ClassReader.(ClassReader.java:136)


How on earth can I import KafkaUtils and related modules on spark 3.0.1.
Where is KafkaUtils module on pyspark of Spark 3.0.1 or how can the pyspark
module can be installed? Any reply will be welcome. Best regards.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Introducing Gallia: a Scala+Spark library for data manipulation

2021-02-16 Thread galliaproject
I posted a quick update on the  scala mailing list

 
, which mostly discusses Scala 2.13 support, additional examples and
licensing.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Using Custom Scala Spark ML Estimator in PySpark

2021-02-16 Thread HARSH TAKKAR
Hello Sean,

Thanks for the advice, can you please point me to an example where i can
find a custom wrapper for python.


Kind Regards
Harsh Takkar

On Tue, 16 Feb, 2021, 8:25 pm Sean Owen,  wrote:

> You won't be able to use it in python if it is implemented in Java - needs
> a python wrapper too.
>
> On Mon, Feb 15, 2021, 11:29 PM HARSH TAKKAR  wrote:
>
>> Hi ,
>>
>> I have created a custom Estimator in scala, which i can use successfully
>> by creating a pipeline model in Java and scala, But when i try to load the
>> pipeline model saved using scala api in pyspark, i am getting an error
>> saying module not found.
>>
>> I have included my custom model jar in the class pass using "spark.jars"
>>
>> Can you please help, if i am missing something.
>>
>> Kind Regards
>> Harsh Takkar
>>
>


Re: Using Custom Scala Spark ML Estimator in PySpark

2021-02-16 Thread Sean Owen
You won't be able to use it in python if it is implemented in Java - needs
a python wrapper too.

On Mon, Feb 15, 2021, 11:29 PM HARSH TAKKAR  wrote:

> Hi ,
>
> I have created a custom Estimator in scala, which i can use successfully
> by creating a pipeline model in Java and scala, But when i try to load the
> pipeline model saved using scala api in pyspark, i am getting an error
> saying module not found.
>
> I have included my custom model jar in the class pass using "spark.jars"
>
> Can you please help, if i am missing something.
>
> Kind Regards
> Harsh Takkar
>


Re: vm.swappiness value for Spark on Kubernetes

2021-02-16 Thread Sean Owen
You probably don't want swapping in any environment. Some tasks will grind
to a halt under mem pressure rather than just fail quickly. You would want
to simply provision more memory.

On Tue, Feb 16, 2021, 7:57 AM Jahar Tyagi  wrote:

> Hi,
>
> We have recently migrated from Spark 2.4.4 to Spark 3.0.1 and using Spark
> in virtual machine/bare metal as standalone deployment and as kubernetes
> deployment as well.
>
> There is a kernel parameter named as 'vm.swappiness' and we keep its value
> as '1' in standard deployment. Now since we are moving to kubernetes and on
> kubernetes worker nodes the value of this parameter is '60'.
>
> Now my question is if it is OK to keep such a high value of
> 'vm.swappiness'=60 in kubernetes environment for Spark workloads.
>
> Will such high value of this kernel parameter have performance impact on
> Spark PODs?
> As per below link from cloudera, they suggest not to set such a high
> value.
>
>
> https://docs.cloudera.com/cloudera-manager/7.2.6/managing-clusters/topics/cm-setting-vmswappiness-linux-kernel-parameter.html
>
> Any thoughts/suggestions on this are highly appreciated.
>
> Regards
> Jahar Tyagi
>
>


Using DataFrame to Read Avro files

2021-02-16 Thread VenkateshDurai
While using Spark 2.4.7 to read avro file getting below error. 
java.lang.NoClassDefFoundError:
org/apache/spark/sql/connector/catalog/TableProvider
  at java.lang.ClassLoader.defineClass1(Native Method)

Code
val dataDF= spark.read.format("avro").load("E:\\Avro\\12.avro")

Please provide pre-requested configurations. 




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



vm.swappiness value for Spark on Kubernetes

2021-02-16 Thread Jahar Tyagi
Hi,

We have recently migrated from Spark 2.4.4 to Spark 3.0.1 and using Spark
in virtual machine/bare metal as standalone deployment and as kubernetes
deployment as well.

There is a kernel parameter named as 'vm.swappiness' and we keep its value
as '1' in standard deployment. Now since we are moving to kubernetes and on
kubernetes worker nodes the value of this parameter is '60'.

Now my question is if it is OK to keep such a high value of
'vm.swappiness'=60 in kubernetes environment for Spark workloads.

Will such high value of this kernel parameter have performance impact on
Spark PODs?
As per below link from cloudera, they suggest not to set such a high value.

https://docs.cloudera.com/cloudera-manager/7.2.6/managing-clusters/topics/cm-setting-vmswappiness-linux-kernel-parameter.html

Any thoughts/suggestions on this are highly appreciated.

Regards
Jahar Tyagi


Re: Using Custom Scala Spark ML Estimator in PySpark

2021-02-16 Thread Mich Talebzadeh
Hi,

Specifically is this a run time or compilation error.

I gather by class path you mean something like below

spark-submit --master yarn --deploy-mode client --driver-class-path
  --jars ..

HTH





LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 16 Feb 2021 at 05:23, HARSH TAKKAR  wrote:

> Hi ,
>
> I have created a custom Estimator in scala, which i can use successfully
> by creating a pipeline model in Java and scala, But when i try to load the
> pipeline model saved using scala api in pyspark, i am getting an error
> saying module not found.
>
> I have included my custom model jar in the class pass using "spark.jars"
>
> Can you please help, if i am missing something.
>
> Kind Regards
> Harsh Takkar
>