Re: [Spark] spark client for Hadoop 2.x

2022-04-06 Thread Morven Huang
I remember that ./dev/make-distribution.sh in spark source allows people to specify Hadoop version. > 2022年4月6日 下午4:31,Amin Borjian 写道: > > From Spark version 3.1.0 onwards, the clients provided for Spark are built > with Hadoop 3 and placed in maven Repository. Unfortunately we use Hadoop

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Mich Talebzadeh
Your statement below: I believe I have found the issue: the job writes data to hbase which is on the same cluster. When I keep on processing data and writing with spark to hbase , eventually the garbage collection can not keep up anymore for hbase, and the hbase memory consumption increases. As

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Bjørn Jørgensen
Great, upgrade from 2.4 to 3.X.X It seams like you can use unpersist after df=read(fromhdfs) df2=spark.sql(using df 1) ..df10=spark.sql(using df9) ? I did use kubernetes and spark with S3 API

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Gourav Sengupta
Hi, super duper. Please try to see if you can write out the data to S3, and then write a load script to load that data from S3 to HBase. Regards, Gourav Sengupta On Wed, Apr 6, 2022 at 4:39 PM Joris Billen wrote: > HI, > thanks for your reply. > > > I believe I have found the issue: the job

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Joris Billen
HI, thanks for your reply. I believe I have found the issue: the job writes data to hbase which is on the same cluster. When I keep on processing data and writing with spark to hbase , eventually the garbage collection can not keep up anymore for hbase, and the hbase memory consumption

Re: Writing Custom Spark Readers and Writers

2022-04-06 Thread Enrico Minack
Another project implementing DataSource V2 in Scala with Python wrapper: https://github.com/G-Research/spark-dgraph-connector Cheers, Enrico Am 06.04.22 um 12:01 schrieb Cheng Pan: There are some projects based on Spark DataSource V2 that I hope will help you.

Re: protobuf data as input to spark streaming

2022-04-06 Thread Kiran Biswal
Hello Stelios Preferred language would have been Scala or pyspark but if Java is proven I am open to using it Any sample reference or example code link? How are you handling the peotobuf to spark dataframe conversion (serialization federalization)? Thanks Kiran On Wed, Apr 6, 2022, 2:38 PM

Re: Writing Custom Spark Readers and Writers

2022-04-06 Thread Cheng Pan
There are some projects based on Spark DataSource V2 that I hope will help you. https://github.com/datastax/spark-cassandra-connector https://github.com/housepower/spark-clickhouse-connector https://github.com/oracle/spark-oracle https://github.com/pingcap/tispark Thanks, Cheng Pan On Wed, Apr

Re: Writing Custom Spark Readers and Writers

2022-04-06 Thread Dyanesh Varun
Thanks a lot ! On Wed, 6 Apr, 2022, 15:21 daniel queiroz, wrote: > > https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/read/index.html > > https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/write/index.html > > >

Re: Writing Custom Spark Readers and Writers

2022-04-06 Thread daniel queiroz
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/read/index.html https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/write/index.html https://developer.here.com/documentation/java-scala-dev/dev_guide/spark-connector/index.html Grato, Daniel

Re: protobuf data as input to spark streaming

2022-04-06 Thread Stelios Philippou
Yes we are currently using it as such. Code is in java. Will that work? On Wed, 6 Apr 2022 at 00:51, Kiran Biswal wrote: > Hello Experts > > Has anyone used protobuf (proto3) encoded data (from kafka) as input > source and been able to do spark structured streaming? > > I would appreciate if

[Spark] spark client for Hadoop 2.x

2022-04-06 Thread Amin Borjian
>From Spark version 3.1.0 onwards, the clients provided for Spark are built >with Hadoop 3 and placed in maven Repository. Unfortunately we use Hadoop >2.7.7 in our infrastructure currently. 1) Does Spark have a plan to publish the Spark client dependencies for Hadoop 2.x? 2) Are the new

Writing Custom Spark Readers and Writers

2022-04-06 Thread Dyanesh Varun
Hey team, Can you please share some documentation/blogs where we can get to know how we can write custom sources and sinks for both streaming and static datasets. Thanks in advance Dyanesh Varun