Re: [Spark] spark client for Hadoop 2.x

2022-04-06 Thread Morven Huang
I remember that ./dev/make-distribution.sh in spark source allows people to 
specify Hadoop version.

> 2022年4月6日 下午4:31,Amin Borjian  写道:
> 
> From Spark version 3.1.0 onwards, the clients provided for Spark are built 
> with Hadoop 3 and placed in maven Repository. Unfortunately  we use Hadoop 
> 2.7.7 in our infrastructure currently.
>  
> 1) Does Spark have a plan to publish the Spark client dependencies for Hadoop 
> 2.x?
> 2) Are the new Spark clients capable of connecting to the Hadoop 2.x cluster? 
> (According to a simple test, Spark client 3.2.1 had no problem with the 
> Hadoop 2.7 cluster but we wanted to know if there was any guarantee from 
> Spark?)
>  
> Thank you very much in advance
> Amin Borjian



Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Mich Talebzadeh
Your statement below:


I believe I have found the issue: the job writes data to hbase which is on
the same cluster.
When I keep on processing data and writing with spark to hbase , eventually
the garbage collection can not keep up anymore for hbase, and the hbase
memory consumption increases. As the clusters hosts both hbase and spark,
this leads to an overall increase and at some point you hit the limit of
the available memory on each worker.
I dont think the spark memory is increasing over time.


   1. Where is your cluster on Prem? Do you Have a Hadoop cluster
   with spark using the same nodes as HDFS?
   2. Is your Hbase clustered or standalone and has been created on HDFS
   nodes
   3. Are you writing to Hbase through phoenix or straight to HBase
   4. Where does S3 come into this


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 6 Apr 2022 at 16:41, Joris Billen 
wrote:

> HI,
> thanks for your reply.
>
>
> I believe I have found the issue: the job writes data to hbase which is on
> the same cluster.
> When I keep on processing data and writing with spark to hbase ,
> eventually the garbage collection can not keep up anymore for hbase, and
> the hbase memory consumption increases. As the clusters hosts both hbase
> and spark, this leads to an overall increase and at some point you hit the
> limit of the available memory on each worker.
> I dont think the spark memory is increasing over time.
>
>
>
> Here more details:
>
> **Spark: 2.4
> **operation: many spark sql statements followed by writing data to a nosql
> db from spark
> like this:
> df=read(fromhdfs)
> df2=spark.sql(using df 1)
> ..df10=spark.sql(using df9)
> spark.sql(CACHE TABLE df10)
> df11 =spark.sql(using df10)
> df11.write
> Df12 =spark.sql(using df10)
> df12.write
> df13 =spark.sql(using df10)
> df13.write
> **caching: yes one df that I will use to eventually write 3 x to a db
> (those 3 are different)
> **Loops: since I need to process several years, and processing 1 day is
> already a complex process (40 minutes on 9 node cluster running quite a bit
> of executors). So in the end it will do all at one go and there is a limit
> of how much data I can process in one go with the available resources.
> Some people here pointed out they believe this looping should not be
> necessary. But what is the alternative?
> —> Maybe I can write to disk somewhere in the middle, and read again from
> there so that in the end not all must happen in one go in memory.
>
>
>
>
>
>
>
> On 5 Apr 2022, at 14:58, Gourav Sengupta 
> wrote:
>
> Hi,
>
> can you please give details around:
> spark version, what is the operation that you are running, why in loops,
> and whether you are caching in any data or not, and whether you are
> referencing the variables to create them like in the following expression
> we are referencing x to create x, x = x + 1
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Mon, Apr 4, 2022 at 10:51 AM Joris Billen <
> joris.bil...@bigindustries.be> wrote:
>
>> Clear-probably not a good idea.
>>
>> But a previous comment said “you are doing everything in the end in one
>> go”.
>> So this made me wonder: in case your only action is a write in the end
>> after lots of complex transformations, then what is the alternative for
>> writing in the end which means doing everything all at once in the end? My
>> understanding is that if there is no need for an action earlier, you will
>> do all at the end, which means there is a limitation to how many days you
>> can process at once. And hence the solution is to loop over a couple days,
>> and submit always the same spark job just for other input.
>>
>>
>> Thanks!
>>
>> On 1 Apr 2022, at 15:26, Sean Owen  wrote:
>>
>> This feels like premature optimization, and not clear it's optimizing,
>> but maybe.
>> Caching things that are used once is worse than not caching. It looks
>> like a straight-line through to the write, so I doubt caching helps
>> anything here.
>>
>> On Fri, Apr 1, 2022 at 2:49 AM Joris Billen <
>> joris.bil...@bigindustries.be> wrote:
>>
>>> Hi,
>>> as said thanks for little discussion over mail.
>>> I understand that the action is triggered in the end at the write and
>>> then all of a sudden everything is executed at once. But I dont really need
>>> to trigger an action before. I am caching somewherew a df that will be
>>> reused several times (slightly updated pseudocode below).
>>>
>>> Question: is it then better practice to already trigger some actions on
>>>  intermediate data frame (like df4 and df8), and cache them? So that these
>>> 

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Bjørn Jørgensen
Great, upgrade from 2.4 to 3.X.X
It seams like you can use unpersist

 after
df=read(fromhdfs)
df2=spark.sql(using df 1)
..df10=spark.sql(using df9)
?

I did use kubernetes and spark with S3 API min.io
 and it works :)
With kubernetes you deploy everything with k8s resource limits

If
you have done it this way, then you won't have this problem.


ons. 6. apr. 2022 kl. 19:21 skrev Gourav Sengupta :

> Hi,
> super duper.
>
> Please try to see if you can write out the data to S3, and then write a
> load script to load that data from S3 to HBase.
>
>
> Regards,
> Gourav Sengupta
>
>
> On Wed, Apr 6, 2022 at 4:39 PM Joris Billen 
> wrote:
>
>> HI,
>> thanks for your reply.
>>
>>
>> I believe I have found the issue: the job writes data to hbase which is
>> on the same cluster.
>> When I keep on processing data and writing with spark to hbase ,
>> eventually the garbage collection can not keep up anymore for hbase, and
>> the hbase memory consumption increases. As the clusters hosts both hbase
>> and spark, this leads to an overall increase and at some point you hit the
>> limit of the available memory on each worker.
>> I dont think the spark memory is increasing over time.
>>
>>
>>
>> Here more details:
>>
>> **Spark: 2.4
>> **operation: many spark sql statements followed by writing data to a
>> nosql db from spark
>> like this:
>> df=read(fromhdfs)
>> df2=spark.sql(using df 1)
>> ..df10=spark.sql(using df9)
>> spark.sql(CACHE TABLE df10)
>> df11 =spark.sql(using df10)
>> df11.write
>> Df12 =spark.sql(using df10)
>> df12.write
>> df13 =spark.sql(using df10)
>> df13.write
>> **caching: yes one df that I will use to eventually write 3 x to a db
>> (those 3 are different)
>> **Loops: since I need to process several years, and processing 1 day is
>> already a complex process (40 minutes on 9 node cluster running quite a bit
>> of executors). So in the end it will do all at one go and there is a limit
>> of how much data I can process in one go with the available resources.
>> Some people here pointed out they believe this looping should not be
>> necessary. But what is the alternative?
>> —> Maybe I can write to disk somewhere in the middle, and read again from
>> there so that in the end not all must happen in one go in memory.
>>
>>
>>
>>
>>
>>
>>
>> On 5 Apr 2022, at 14:58, Gourav Sengupta 
>> wrote:
>>
>> Hi,
>>
>> can you please give details around:
>> spark version, what is the operation that you are running, why in loops,
>> and whether you are caching in any data or not, and whether you are
>> referencing the variables to create them like in the following expression
>> we are referencing x to create x, x = x + 1
>>
>> Thanks and Regards,
>> Gourav Sengupta
>>
>> On Mon, Apr 4, 2022 at 10:51 AM Joris Billen <
>> joris.bil...@bigindustries.be> wrote:
>>
>>> Clear-probably not a good idea.
>>>
>>> But a previous comment said “you are doing everything in the end in one
>>> go”.
>>> So this made me wonder: in case your only action is a write in the end
>>> after lots of complex transformations, then what is the alternative for
>>> writing in the end which means doing everything all at once in the end? My
>>> understanding is that if there is no need for an action earlier, you will
>>> do all at the end, which means there is a limitation to how many days you
>>> can process at once. And hence the solution is to loop over a couple days,
>>> and submit always the same spark job just for other input.
>>>
>>>
>>> Thanks!
>>>
>>> On 1 Apr 2022, at 15:26, Sean Owen  wrote:
>>>
>>> This feels like premature optimization, and not clear it's optimizing,
>>> but maybe.
>>> Caching things that are used once is worse than not caching. It looks
>>> like a straight-line through to the write, so I doubt caching helps
>>> anything here.
>>>
>>> On Fri, Apr 1, 2022 at 2:49 AM Joris Billen <
>>> joris.bil...@bigindustries.be> wrote:
>>>
 Hi,
 as said thanks for little discussion over mail.
 I understand that the action is triggered in the end at the write and
 then all of a sudden everything is executed at once. But I dont really need
 to trigger an action before. I am caching somewherew a df that will be
 reused several times (slightly updated pseudocode below).

 Question: is it then better practice to already trigger some actions on
  intermediate data frame (like df4 and df8), and cache them? So that these
 actions will not be that expensive yet, and the actions to write at the end
 will require less resources, which would allow to process more days in one
 go? LIke what is added in red in improvement section in the pseudo
 code below?



 *pseudocode:*


 *loop over all days:*
 *spark submit 1 day*



 

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Gourav Sengupta
Hi,
super duper.

Please try to see if you can write out the data to S3, and then write a
load script to load that data from S3 to HBase.


Regards,
Gourav Sengupta


On Wed, Apr 6, 2022 at 4:39 PM Joris Billen 
wrote:

> HI,
> thanks for your reply.
>
>
> I believe I have found the issue: the job writes data to hbase which is on
> the same cluster.
> When I keep on processing data and writing with spark to hbase ,
> eventually the garbage collection can not keep up anymore for hbase, and
> the hbase memory consumption increases. As the clusters hosts both hbase
> and spark, this leads to an overall increase and at some point you hit the
> limit of the available memory on each worker.
> I dont think the spark memory is increasing over time.
>
>
>
> Here more details:
>
> **Spark: 2.4
> **operation: many spark sql statements followed by writing data to a nosql
> db from spark
> like this:
> df=read(fromhdfs)
> df2=spark.sql(using df 1)
> ..df10=spark.sql(using df9)
> spark.sql(CACHE TABLE df10)
> df11 =spark.sql(using df10)
> df11.write
> Df12 =spark.sql(using df10)
> df12.write
> df13 =spark.sql(using df10)
> df13.write
> **caching: yes one df that I will use to eventually write 3 x to a db
> (those 3 are different)
> **Loops: since I need to process several years, and processing 1 day is
> already a complex process (40 minutes on 9 node cluster running quite a bit
> of executors). So in the end it will do all at one go and there is a limit
> of how much data I can process in one go with the available resources.
> Some people here pointed out they believe this looping should not be
> necessary. But what is the alternative?
> —> Maybe I can write to disk somewhere in the middle, and read again from
> there so that in the end not all must happen in one go in memory.
>
>
>
>
>
>
>
> On 5 Apr 2022, at 14:58, Gourav Sengupta 
> wrote:
>
> Hi,
>
> can you please give details around:
> spark version, what is the operation that you are running, why in loops,
> and whether you are caching in any data or not, and whether you are
> referencing the variables to create them like in the following expression
> we are referencing x to create x, x = x + 1
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Mon, Apr 4, 2022 at 10:51 AM Joris Billen <
> joris.bil...@bigindustries.be> wrote:
>
>> Clear-probably not a good idea.
>>
>> But a previous comment said “you are doing everything in the end in one
>> go”.
>> So this made me wonder: in case your only action is a write in the end
>> after lots of complex transformations, then what is the alternative for
>> writing in the end which means doing everything all at once in the end? My
>> understanding is that if there is no need for an action earlier, you will
>> do all at the end, which means there is a limitation to how many days you
>> can process at once. And hence the solution is to loop over a couple days,
>> and submit always the same spark job just for other input.
>>
>>
>> Thanks!
>>
>> On 1 Apr 2022, at 15:26, Sean Owen  wrote:
>>
>> This feels like premature optimization, and not clear it's optimizing,
>> but maybe.
>> Caching things that are used once is worse than not caching. It looks
>> like a straight-line through to the write, so I doubt caching helps
>> anything here.
>>
>> On Fri, Apr 1, 2022 at 2:49 AM Joris Billen <
>> joris.bil...@bigindustries.be> wrote:
>>
>>> Hi,
>>> as said thanks for little discussion over mail.
>>> I understand that the action is triggered in the end at the write and
>>> then all of a sudden everything is executed at once. But I dont really need
>>> to trigger an action before. I am caching somewherew a df that will be
>>> reused several times (slightly updated pseudocode below).
>>>
>>> Question: is it then better practice to already trigger some actions on
>>>  intermediate data frame (like df4 and df8), and cache them? So that these
>>> actions will not be that expensive yet, and the actions to write at the end
>>> will require less resources, which would allow to process more days in one
>>> go? LIke what is added in red in improvement section in the pseudo code
>>> below?
>>>
>>>
>>>
>>> *pseudocode:*
>>>
>>>
>>> *loop over all days:*
>>> *spark submit 1 day*
>>>
>>>
>>>
>>> with spark submit (overly simplified)=
>>>
>>>
>>> *  df=spark.read(hfs://somepath)*
>>> *  …*
>>> *   ##IMPROVEMENT START*
>>> *   df4=spark.sql(some stuff with df3)*
>>> *   spark.sql(CACHE TABLE df4)*
>>> *   …*
>>> *   df8=spark.sql(some stuff with df7)*
>>> *   spark.sql(CACHE TABLE df8)*
>>> *  ##IMPROVEMENT END*
>>> *   ...*
>>> *   df12=df11.spark.sql(complex stufff)*
>>> *  spark.sql(CACHE TABLE df10)*
>>> *   ...*
>>> *  df13=spark.sql( complex stuff with df12)*
>>> *  df13.write *
>>> *  df14=spark.sql( some other complex stuff with df12)*
>>> *  df14.write *
>>> *  df15=spark.sql( some completely other complex stuff with df12)*
>>> *  df15.write *
>>>
>>>
>>>
>>>
>>>
>>>
>>> THanks!
>>>
>>>
>>>
>>> On 31 Mar 2022, at 14:37, Sean Owen  wrote:

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Joris Billen
HI,
thanks for your reply.


I believe I have found the issue: the job writes data to hbase which is on the 
same cluster.
When I keep on processing data and writing with spark to hbase , eventually the 
garbage collection can not keep up anymore for hbase, and the hbase memory 
consumption increases. As the clusters hosts both hbase and spark, this leads 
to an overall increase and at some point you hit the limit of the available 
memory on each worker.
I dont think the spark memory is increasing over time.



Here more details:

**Spark: 2.4
**operation: many spark sql statements followed by writing data to a nosql db 
from spark
like this:
df=read(fromhdfs)
df2=spark.sql(using df 1)
..df10=spark.sql(using df9)
spark.sql(CACHE TABLE df10)
df11 =spark.sql(using df10)
df11.write
Df12 =spark.sql(using df10)
df12.write
df13 =spark.sql(using df10)
df13.write
**caching: yes one df that I will use to eventually write 3 x to a db (those 3 
are different)
**Loops: since I need to process several years, and processing 1 day is already 
a complex process (40 minutes on 9 node cluster running quite a bit of 
executors). So in the end it will do all at one go and there is a limit of how 
much data I can process in one go with the available resources.
Some people here pointed out they believe this looping should not be necessary. 
But what is the alternative?
—> Maybe I can write to disk somewhere in the middle, and read again from there 
so that in the end not all must happen in one go in memory.







On 5 Apr 2022, at 14:58, Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:

Hi,

can you please give details around:
spark version, what is the operation that you are running, why in loops, and 
whether you are caching in any data or not, and whether you are referencing the 
variables to create them like in the following expression we are referencing x 
to create x, x = x + 1

Thanks and Regards,
Gourav Sengupta

On Mon, Apr 4, 2022 at 10:51 AM Joris Billen 
mailto:joris.bil...@bigindustries.be>> wrote:
Clear-probably not a good idea.

But a previous comment said “you are doing everything in the end in one go”.
So this made me wonder: in case your only action is a write in the end after 
lots of complex transformations, then what is the alternative for writing in 
the end which means doing everything all at once in the end? My understanding 
is that if there is no need for an action earlier, you will do all at the end, 
which means there is a limitation to how many days you can process at once. And 
hence the solution is to loop over a couple days, and submit always the same 
spark job just for other input.


Thanks!

On 1 Apr 2022, at 15:26, Sean Owen mailto:sro...@gmail.com>> 
wrote:

This feels like premature optimization, and not clear it's optimizing, but 
maybe.
Caching things that are used once is worse than not caching. It looks like a 
straight-line through to the write, so I doubt caching helps anything here.

On Fri, Apr 1, 2022 at 2:49 AM Joris Billen 
mailto:joris.bil...@bigindustries.be>> wrote:
Hi,
as said thanks for little discussion over mail.
I understand that the action is triggered in the end at the write and then all 
of a sudden everything is executed at once. But I dont really need to trigger 
an action before. I am caching somewherew a df that will be reused several 
times (slightly updated pseudocode below).

Question: is it then better practice to already trigger some actions on  
intermediate data frame (like df4 and df8), and cache them? So that these 
actions will not be that expensive yet, and the actions to write at the end 
will require less resources, which would allow to process more days in one go? 
LIke what is added in red in improvement section in the pseudo code below?



pseudocode:


loop over all days:
spark submit 1 day



with spark submit (overly simplified)=


  df=spark.read(hfs://somepath)
  …
   ##IMPROVEMENT START
   df4=spark.sql(some stuff with df3)
   spark.sql(CACHE TABLE df4)
   …
   df8=spark.sql(some stuff with df7)
   spark.sql(CACHE TABLE df8)
  ##IMPROVEMENT END
   ...
   df12=df11.spark.sql(complex stufff)
  spark.sql(CACHE TABLE df10)
   ...
  df13=spark.sql( complex stuff with df12)
  df13.write
  df14=spark.sql( some other complex stuff with df12)
  df14.write
  df15=spark.sql( some completely other complex stuff with df12)
  df15.write






THanks!



On 31 Mar 2022, at 14:37, Sean Owen mailto:sro...@gmail.com>> 
wrote:

If that is your loop unrolled, then you are not doing parts of work at a time. 
That will execute all operations in one go when the write finally happens. 
That's OK, but may be part of the problem. For example if you are filtering for 
a subset, processing, and unioning, then that is just a harder and slower way 
of applying the transformation to all data at once.

On Thu, Mar 31, 2022 at 3:30 AM Joris Billen 
mailto:joris.bil...@bigindustries.be>> wrote:
Thanks for reply :-)

I am using pyspark. Basicially my code 

Re: Writing Custom Spark Readers and Writers

2022-04-06 Thread Enrico Minack

Another project implementing DataSource V2 in Scala with Python wrapper:

https://github.com/G-Research/spark-dgraph-connector

Cheers,
Enrico


Am 06.04.22 um 12:01 schrieb Cheng Pan:

There are some projects based on Spark DataSource V2 that I hope will help you.

https://github.com/datastax/spark-cassandra-connector
https://github.com/housepower/spark-clickhouse-connector
https://github.com/oracle/spark-oracle
https://github.com/pingcap/tispark

Thanks,
Cheng Pan

On Wed, Apr 6, 2022 at 5:52 PM daniel queiroz  wrote:

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/read/index.html
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/write/index.html

https://developer.here.com/documentation/java-scala-dev/dev_guide/spark-connector/index.html

Grato,

Daniel Queiroz
81 996289671


Em qua., 6 de abr. de 2022 às 03:57, Dyanesh Varun  
escreveu:

Hey team,

Can you please share some documentation/blogs where we can get to know how we 
can write custom sources and sinks for both streaming and static datasets.

Thanks in advance
Dyanesh Varun


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: protobuf data as input to spark streaming

2022-04-06 Thread Kiran Biswal
Hello Stelios

Preferred language would have been Scala or pyspark but if Java is proven I
am open to using it

Any sample reference or example code link?

How are you handling the peotobuf to spark dataframe conversion
(serialization federalization)?

Thanks
Kiran

On Wed, Apr 6, 2022, 2:38 PM Stelios Philippou  wrote:

> Yes we are currently using it as such.
> Code is in java. Will that work?
>
> On Wed, 6 Apr 2022 at 00:51, Kiran Biswal  wrote:
>
>> Hello Experts
>>
>> Has anyone used protobuf (proto3) encoded data (from kafka) as input
>> source and been able to do spark structured streaming?
>>
>> I would appreciate if you can share any sample code/example
>>
>> Regards
>> Kiran
>>
>>>


Re: Writing Custom Spark Readers and Writers

2022-04-06 Thread Cheng Pan
There are some projects based on Spark DataSource V2 that I hope will help you.

https://github.com/datastax/spark-cassandra-connector
https://github.com/housepower/spark-clickhouse-connector
https://github.com/oracle/spark-oracle
https://github.com/pingcap/tispark

Thanks,
Cheng Pan

On Wed, Apr 6, 2022 at 5:52 PM daniel queiroz  wrote:
>
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/read/index.html
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/write/index.html
>
> https://developer.here.com/documentation/java-scala-dev/dev_guide/spark-connector/index.html
>
> Grato,
>
> Daniel Queiroz
> 81 996289671
>
>
> Em qua., 6 de abr. de 2022 às 03:57, Dyanesh Varun 
>  escreveu:
>>
>> Hey team,
>>
>> Can you please share some documentation/blogs where we can get to know how 
>> we can write custom sources and sinks for both streaming and static datasets.
>>
>> Thanks in advance
>> Dyanesh Varun
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Writing Custom Spark Readers and Writers

2022-04-06 Thread Dyanesh Varun
Thanks a lot !

On Wed, 6 Apr, 2022, 15:21 daniel queiroz,  wrote:

>
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/read/index.html
>
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/write/index.html
>
>
> https://developer.here.com/documentation/java-scala-dev/dev_guide/spark-connector/index.html
>
> Grato,
>
> Daniel Queiroz
> 81 996289671
>
>
> Em qua., 6 de abr. de 2022 às 03:57, Dyanesh Varun <
> dyaneshvarun...@gmail.com> escreveu:
>
>> Hey team,
>>
>> Can you please share some documentation/blogs where we can get to know
>> how we can write custom sources and sinks for both streaming and static
>> datasets.
>>
>> Thanks in advance
>> Dyanesh Varun
>>
>>


Re: Writing Custom Spark Readers and Writers

2022-04-06 Thread daniel queiroz
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/read/index.html
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/write/index.html

https://developer.here.com/documentation/java-scala-dev/dev_guide/spark-connector/index.html

Grato,

Daniel Queiroz
81 996289671


Em qua., 6 de abr. de 2022 às 03:57, Dyanesh Varun <
dyaneshvarun...@gmail.com> escreveu:

> Hey team,
>
> Can you please share some documentation/blogs where we can get to know how
> we can write custom sources and sinks for both streaming and static
> datasets.
>
> Thanks in advance
> Dyanesh Varun
>
>


Re: protobuf data as input to spark streaming

2022-04-06 Thread Stelios Philippou
Yes we are currently using it as such.
Code is in java. Will that work?

On Wed, 6 Apr 2022 at 00:51, Kiran Biswal  wrote:

> Hello Experts
>
> Has anyone used protobuf (proto3) encoded data (from kafka) as input
> source and been able to do spark structured streaming?
>
> I would appreciate if you can share any sample code/example
>
> Regards
> Kiran
>
>>


[Spark] spark client for Hadoop 2.x

2022-04-06 Thread Amin Borjian
>From Spark version 3.1.0 onwards, the clients provided for Spark are built 
>with Hadoop 3 and placed in maven Repository. Unfortunately  we use Hadoop 
>2.7.7 in our infrastructure currently.

1) Does Spark have a plan to publish the Spark client dependencies for Hadoop 
2.x?
2) Are the new Spark clients capable of connecting to the Hadoop 2.x cluster? 
(According to a simple test, Spark client 3.2.1 had no problem with the Hadoop 
2.7 cluster but we wanted to know if there was any guarantee from Spark?)

Thank you very much in advance
Amin Borjian


Writing Custom Spark Readers and Writers

2022-04-06 Thread Dyanesh Varun
Hey team,

Can you please share some documentation/blogs where we can get to know how
we can write custom sources and sinks for both streaming and static
datasets.

Thanks in advance
Dyanesh Varun