Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Mich Talebzadeh
Hi Lingzhe Sun,

Thanks for your comments. I am afraid I won't be able to take part in this
project and contribute.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 12 Apr 2023 at 02:55, Lingzhe Sun  wrote:

> Hi Mich,
>
> FYI we're using spark operator(
> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build
> stateful structured streaming on k8s for a year. Haven't test it using
> non-operator way.
>
> Besides that, the main contributor of the spark operator, Yinan Li, has
> been inactive for quite long time. Kind of worried that this project might
> finally become outdated as k8s is evolving. So if anyone is interested,
> please support the project.
>
> --
> Lingzhe Sun
> Hirain Technologies
>
>
> *From:* Mich Talebzadeh 
> *Date:* 2023-04-11 02:06
> *To:* Rajesh Katkar 
> *CC:* user 
> *Subject:* Re: spark streaming and kinesis integration
> What I said was this
> "In so far as I know k8s does not support spark structured streaming?"
>
> So it is an open question. I just recalled it. I have not tested myself. I
> know structured streaming works on Google Dataproc cluster but I have not
> seen any official link that says Spark Structured Streaming is supported on
> k8s.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 10 Apr 2023 at 06:31, Rajesh Katkar 
> wrote:
>
>> Do you have any link or ticket which justifies that k8s does not support
>> spark streaming ?
>>
>> On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, 
>> wrote:
>>
>>> Do you have a high level diagram of the proposed solution?
>>>
>>> In so far as I know k8s does not support spark structured streaming?
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar 
>>> wrote:
>>>
>>>> Use case is , we want to read/write to kinesis streams using k8s
>>>> Officially I could not find the connector or reader for kinesis from
>>>> spark like it has for kafka.
>>>>
>>>> Checking here if anyone used kinesis and spark streaming combination ?
>>>>
>>>> On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Rajesh,
>>>>>
>>>>> What is the use case for Kinesis here? I have not used it personally,
>>>>> Which use case it concerns
>>>>>
>>>>> https://aws.amazon.com/kinesis/
>>>>>
>>>>> Can you use something else instead?
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Lead Solutions Archi

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread 孙令哲
Hi Rajesh,


It's working fine, at least for now. But you'll need to build your own spark 
image using later versions.


Lingzhe Sun
Hirain Technologies

 







Original:
From:Rajesh Katkar Date:2023-04-12 21:36:52To:Lingzhe 
SunCc:Mich Talebzadeh  , 
user Subject:Re: Re: spark streaming and 
kinesis integrationHi Lingzhe,

We are also started using this operator.
Do you see any issues with it? 




On Wed, 12 Apr, 2023, 7:25 am Lingzhe Sun,  wrote:

Hi Mich,


FYI we're using spark 
operator(https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build 
stateful structured streaming on k8s for a year. Haven't test it using 
non-operator way.


Besides that, the main contributor of the spark operator, Yinan Li, has been 
inactive for quite long time. Kind of worried that this project might finally 
become outdated as k8s is evolving. So if anyone is interested, please support 
the project.


Lingzhe Sun
Hirain Technologies

 
From: Mich Talebzadeh
Date: 2023-04-11 02:06
To: Rajesh Katkar
CC: user
Subject: Re: spark streaming and kinesis integration


What I said was this"In so far as I know k8s does not support spark structured 
streaming?"


So it is an open question. I just recalled it. I have not tested myself. I know 
structured streaming works on Google Dataproc cluster but I have not seen any 
official link that says Spark Structured Streaming is supported on k8s.


HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom




   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.
 










On Mon, 10 Apr 2023 at 06:31, Rajesh Katkar  wrote:

Do you have any link or ticket which justifies that k8s does not support spark 
streaming ?

On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh,  wrote:

Do you have a high level diagram of the proposed solution?

In so far as I know k8s does not support spark structured streaming?

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom




   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.
 










On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar  wrote:

Use case is , we want to read/write to kinesis streams using k8sOfficially I 
could not find the connector or reader for kinesis from spark like it has for 
kafka.


Checking here if anyone used kinesis and spark streaming combination ?


On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh,  wrote:

Hi Rajesh,

What is the use case for Kinesis here? I have not used it personally, Which use 
case it concerns


https://aws.amazon.com/kinesis/



Can you use something else instead?


HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom




   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.
 










On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar  wrote:

Hi Spark Team,
We need to read/write the kinesis streams using spark streaming.
 We checked the official documentation - 
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
It does not mention kinesis connector. Alternative is - 
https://github.com/qubole/kinesis-sql which is not active now.  This is now 
handed over here - https://github.com/roncemer/spark-sql-kinesis
Also according to SPARK-18165 , Spark officially do not have any kinesis 
connector 
We have few below questions , It would be great if you can answer 
Does Spark provides officially any kinesis connector which have 
readstream/writestream and endorse any connector for production use cases ?  
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html This 
documentation does not mention how to write to kinesis. This method has default 
dynamodb as checkpoint, can we override it ?We have rocksdb as a state store 
but when we ran an application using official  
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html rocksdb 
configura

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Yi Huang
unsubscribe

On Wed, Apr 12, 2023 at 3:59 PM Rajesh Katkar 
wrote:

> Hi Lingzhe,
>
> We are also started using this operator.
> Do you see any issues with it?
>
>
> On Wed, 12 Apr, 2023, 7:25 am Lingzhe Sun,  wrote:
>
>> Hi Mich,
>>
>> FYI we're using spark operator(
>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build
>> stateful structured streaming on k8s for a year. Haven't test it using
>> non-operator way.
>>
>> Besides that, the main contributor of the spark operator, Yinan Li, has
>> been inactive for quite long time. Kind of worried that this project might
>> finally become outdated as k8s is evolving. So if anyone is interested,
>> please support the project.
>>
>> --
>> Lingzhe Sun
>> Hirain Technologies
>>
>>
>> *From:* Mich Talebzadeh 
>> *Date:* 2023-04-11 02:06
>> *To:* Rajesh Katkar 
>> *CC:* user 
>> *Subject:* Re: spark streaming and kinesis integration
>> What I said was this
>> "In so far as I know k8s does not support spark structured streaming?"
>>
>> So it is an open question. I just recalled it. I have not tested myself.
>> I know structured streaming works on Google Dataproc cluster but I have not
>> seen any official link that says Spark Structured Streaming is supported on
>> k8s.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 10 Apr 2023 at 06:31, Rajesh Katkar 
>> wrote:
>>
>>> Do you have any link or ticket which justifies that k8s does not support
>>> spark streaming ?
>>>
>>> On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, 
>>> wrote:
>>>
>>>> Do you have a high level diagram of the proposed solution?
>>>>
>>>> In so far as I know k8s does not support spark structured streaming?
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar 
>>>> wrote:
>>>>
>>>>> Use case is , we want to read/write to kinesis streams using k8s
>>>>> Officially I could not find the connector or reader for kinesis from
>>>>> spark like it has for kafka.
>>>>>
>>>>> Checking here if anyone used kinesis and spark streaming combination ?
>>>>>
>>>>> On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Hi Rajesh,
>>>>>>
>>>>>> What is the use case for Kinesis here? I have not used it personally,
>>>>>> Which use case it concerns
>>>>>>
>>>>>> https://aws.amazon.com/kinesis/
>>>>>>
>>>>>> Can you use something else instead?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>> Palantir Tec

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Rajesh Katkar
Hi Lingzhe,

We are also started using this operator.
Do you see any issues with it?


On Wed, 12 Apr, 2023, 7:25 am Lingzhe Sun,  wrote:

> Hi Mich,
>
> FYI we're using spark operator(
> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build
> stateful structured streaming on k8s for a year. Haven't test it using
> non-operator way.
>
> Besides that, the main contributor of the spark operator, Yinan Li, has
> been inactive for quite long time. Kind of worried that this project might
> finally become outdated as k8s is evolving. So if anyone is interested,
> please support the project.
>
> --
> Lingzhe Sun
> Hirain Technologies
>
>
> *From:* Mich Talebzadeh 
> *Date:* 2023-04-11 02:06
> *To:* Rajesh Katkar 
> *CC:* user 
> *Subject:* Re: spark streaming and kinesis integration
> What I said was this
> "In so far as I know k8s does not support spark structured streaming?"
>
> So it is an open question. I just recalled it. I have not tested myself. I
> know structured streaming works on Google Dataproc cluster but I have not
> seen any official link that says Spark Structured Streaming is supported on
> k8s.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 10 Apr 2023 at 06:31, Rajesh Katkar 
> wrote:
>
>> Do you have any link or ticket which justifies that k8s does not support
>> spark streaming ?
>>
>> On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, 
>> wrote:
>>
>>> Do you have a high level diagram of the proposed solution?
>>>
>>> In so far as I know k8s does not support spark structured streaming?
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar 
>>> wrote:
>>>
>>>> Use case is , we want to read/write to kinesis streams using k8s
>>>> Officially I could not find the connector or reader for kinesis from
>>>> spark like it has for kafka.
>>>>
>>>> Checking here if anyone used kinesis and spark streaming combination ?
>>>>
>>>> On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Rajesh,
>>>>>
>>>>> What is the use case for Kinesis here? I have not used it personally,
>>>>> Which use case it concerns
>>>>>
>>>>> https://aws.amazon.com/kinesis/
>>>>>
>>>>> Can you use something else instead?
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Lead Solutions Architect/Engineering Lead
>>>>> Palantir Technologies
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or an

Re: Re: spark streaming and kinesis integration

2023-04-11 Thread Lingzhe Sun
Hi Mich,

FYI we're using spark 
operator(https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build 
stateful structured streaming on k8s for a year. Haven't test it using 
non-operator way.

Besides that, the main contributor of the spark operator, Yinan Li, has been 
inactive for quite long time. Kind of worried that this project might finally 
become outdated as k8s is evolving. So if anyone is interested, please support 
the project.



Lingzhe Sun
Hirain Technologies
 
From: Mich Talebzadeh
Date: 2023-04-11 02:06
To: Rajesh Katkar
CC: user
Subject: Re: spark streaming and kinesis integration
What I said was this
"In so far as I know k8s does not support spark structured streaming?"

So it is an open question. I just recalled it. I have not tested myself. I know 
structured streaming works on Google Dataproc cluster but I have not seen any 
official link that says Spark Structured Streaming is supported on k8s.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 
 


On Mon, 10 Apr 2023 at 06:31, Rajesh Katkar  wrote:
Do you have any link or ticket which justifies that k8s does not support spark 
streaming ?

On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh,  wrote:
Do you have a high level diagram of the proposed solution?

In so far as I know k8s does not support spark structured streaming?

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 
 


On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar  wrote:
Use case is , we want to read/write to kinesis streams using k8s
Officially I could not find the connector or reader for kinesis from spark like 
it has for kafka.

Checking here if anyone used kinesis and spark streaming combination ?

On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh,  wrote:
Hi Rajesh,

What is the use case for Kinesis here? I have not used it personally, Which use 
case it concerns

https://aws.amazon.com/kinesis/

Can you use something else instead?

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 
 


On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar  wrote:
Hi Spark Team,
We need to read/write the kinesis streams using spark streaming.
 We checked the official documentation - 
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
It does not mention kinesis connector. Alternative is - 
https://github.com/qubole/kinesis-sql which is not active now.  This is now 
handed over here - https://github.com/roncemer/spark-sql-kinesis
Also according to SPARK-18165 , Spark officially do not have any kinesis 
connector 
We have few below questions , It would be great if you can answer 
Does Spark provides officially any kinesis connector which have 
readstream/writestream and endorse any connector for production use cases ?  
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html This 
documentation does not mention how to write to kinesis. This method has default 
dynamodb as checkpoint, can we override it ?
We have rocksdb as a state store but when we ran an application using official  
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html rocksdb 
configurations were not effective. Can you please confirm if rocksdb is not 
applicable in these cases?
rocksdb however works with qubole connector , do you have any plan to release 
kinesis connector?
Please help/recommend us for any good stable kinesis connector or some pointers 
around it


Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
Just to clarify, a major benefit of k8s in this case is to host your Spark
applications in the form of containers in an automated fashion so that one
can easily deploy as many instances of the application as required
(autoscaling). From below:

https://price2meet.com/gcp/docs/dataproc_docs_concepts_configuring-clusters_autoscaling.pdf

Autoscaling does not support Spark Structured Streaming (
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
(see Autoscaling and Spark Structured Streaming
(#autoscaling_and_spark_structured_streaming)) .

On the same token k8s is more suitable (as of now)  for batch jobs than
Spark Structured Streaming.
https://issues.apache.org/jira/browse/SPARK-12133

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 10 Apr 2023 at 19:06, Mich Talebzadeh 
wrote:

> What I said was this
> "In so far as I know k8s does not support spark structured streaming?"
>
> So it is an open question. I just recalled it. I have not tested myself. I
> know structured streaming works on Google Dataproc cluster but I have not
> seen any official link that says Spark Structured Streaming is supported on
> k8s.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 10 Apr 2023 at 06:31, Rajesh Katkar 
> wrote:
>
>> Do you have any link or ticket which justifies that k8s does not support
>> spark streaming ?
>>
>> On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, 
>> wrote:
>>
>>> Do you have a high level diagram of the proposed solution?
>>>
>>> In so far as I know k8s does not support spark structured streaming?
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar 
>>> wrote:
>>>
>>>> Use case is , we want to read/write to kinesis streams using k8s
>>>> Officially I could not find the connector or reader for kinesis from
>>>> spark like it has for kafka.
>>>>
>>>> Checking here if anyone used kinesis and spark streaming combination ?
>>>>
>>>> On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Rajesh,
>>>>>
>>>>> What is the use case for Kinesis here? I have not used it personally,
>>>>> Which use case it concerns
>>>>>
>>>>> https://aws.amazon.com/kinesis/
>>>>>
>>>>> Can you use something else instead?
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Lead Solutions Architect/Engineering Lead
>>>>> Palantir Technologies
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
What I said was this
"In so far as I know k8s does not support spark structured streaming?"

So it is an open question. I just recalled it. I have not tested myself. I
know structured streaming works on Google Dataproc cluster but I have not
seen any official link that says Spark Structured Streaming is supported on
k8s.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 10 Apr 2023 at 06:31, Rajesh Katkar  wrote:

> Do you have any link or ticket which justifies that k8s does not support
> spark streaming ?
>
> On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, 
> wrote:
>
>> Do you have a high level diagram of the proposed solution?
>>
>> In so far as I know k8s does not support spark structured streaming?
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar 
>> wrote:
>>
>>> Use case is , we want to read/write to kinesis streams using k8s
>>> Officially I could not find the connector or reader for kinesis from
>>> spark like it has for kafka.
>>>
>>> Checking here if anyone used kinesis and spark streaming combination ?
>>>
>>> On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, 
>>> wrote:
>>>
>>>> Hi Rajesh,
>>>>
>>>> What is the use case for Kinesis here? I have not used it personally,
>>>> Which use case it concerns
>>>>
>>>> https://aws.amazon.com/kinesis/
>>>>
>>>> Can you use something else instead?
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar 
>>>> wrote:
>>>>
>>>>> Hi Spark Team,
>>>>>
>>>>> We need to read/write the kinesis streams using spark streaming.
>>>>>
>>>>>  We checked the official documentation -
>>>>> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
>>>>>
>>>>> It does not mention kinesis connector. Alternative is -
>>>>> https://github.com/qubole/kinesis-sql which is not active now.  This
>>>>> is now handed over here -
>>>>> https://github.com/roncemer/spark-sql-kinesis
>>>>>
>>>>> Also according to SPARK-18165
>>>>> <https://issues.apache.org/jira/browse/SPARK-18165> , Spark
>>>>> officially do not have any kinesis connector
>>>>>
>>>>> We have few below questions , It would be great if you can answer
>>>>>
>>>>>1. Does Spark provides officially any kinesis connector which have
>>>>>readstream/writestream and endorse any connector for production use 
>>>>> cases ?
>>>>>
>>>>>2.
>>>>>
>>>>> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>>>>> This
>>>>>documentation does not mention how to write to kinesis. This method has
>>>>>default dynamodb as checkpoint, can we override it ?
>>>>>3. We have rocksdb as a state store but when we ran an application
>>>>>using official
>>>>>
>>>>> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>>>>> rocksdb
>>>>>configurations were not effective. Can you please confirm if rocksdb 
>>>>> is not
>>>>>applicable in these cases?
>>>>>4. rocksdb however works with qubole connector , do you have any
>>>>>plan to release kinesis connector?
>>>>>5. Please help/recommend us for any good stable kinesis connector
>>>>>or some pointers around it
>>>>>
>>>>>


Re: spark streaming and kinesis integration

2023-04-10 Thread Rajesh Katkar
Do you have any link or ticket which justifies that k8s does not support
spark streaming ?

On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, 
wrote:

> Do you have a high level diagram of the proposed solution?
>
> In so far as I know k8s does not support spark structured streaming?
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar 
> wrote:
>
>> Use case is , we want to read/write to kinesis streams using k8s
>> Officially I could not find the connector or reader for kinesis from
>> spark like it has for kafka.
>>
>> Checking here if anyone used kinesis and spark streaming combination ?
>>
>> On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, 
>> wrote:
>>
>>> Hi Rajesh,
>>>
>>> What is the use case for Kinesis here? I have not used it personally,
>>> Which use case it concerns
>>>
>>> https://aws.amazon.com/kinesis/
>>>
>>> Can you use something else instead?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar 
>>> wrote:
>>>
>>>> Hi Spark Team,
>>>>
>>>> We need to read/write the kinesis streams using spark streaming.
>>>>
>>>>  We checked the official documentation -
>>>> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
>>>>
>>>> It does not mention kinesis connector. Alternative is -
>>>> https://github.com/qubole/kinesis-sql which is not active now.  This
>>>> is now handed over here - https://github.com/roncemer/spark-sql-kinesis
>>>>
>>>> Also according to SPARK-18165
>>>> <https://issues.apache.org/jira/browse/SPARK-18165> , Spark officially
>>>> do not have any kinesis connector
>>>>
>>>> We have few below questions , It would be great if you can answer
>>>>
>>>>1. Does Spark provides officially any kinesis connector which have
>>>>readstream/writestream and endorse any connector for production use 
>>>> cases ?
>>>>
>>>>2.
>>>>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>>>> This
>>>>documentation does not mention how to write to kinesis. This method has
>>>>default dynamodb as checkpoint, can we override it ?
>>>>3. We have rocksdb as a state store but when we ran an application
>>>>using official
>>>>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>>>> rocksdb
>>>>configurations were not effective. Can you please confirm if rocksdb is 
>>>> not
>>>>applicable in these cases?
>>>>4. rocksdb however works with qubole connector , do you have any
>>>>plan to release kinesis connector?
>>>>5. Please help/recommend us for any good stable kinesis connector
>>>>or some pointers around it
>>>>
>>>>


Re: spark streaming and kinesis integration

2023-04-06 Thread Rajesh Katkar
Use case is , we want to read/write to kinesis streams using k8s
Officially I could not find the connector or reader for kinesis from spark
like it has for kafka.

Checking here if anyone used kinesis and spark streaming combination ?

On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, 
wrote:

> Hi Rajesh,
>
> What is the use case for Kinesis here? I have not used it personally,
> Which use case it concerns
>
> https://aws.amazon.com/kinesis/
>
> Can you use something else instead?
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar 
> wrote:
>
>> Hi Spark Team,
>>
>> We need to read/write the kinesis streams using spark streaming.
>>
>>  We checked the official documentation -
>> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
>>
>> It does not mention kinesis connector. Alternative is -
>> https://github.com/qubole/kinesis-sql which is not active now.  This is
>> now handed over here - https://github.com/roncemer/spark-sql-kinesis
>>
>> Also according to SPARK-18165
>> <https://issues.apache.org/jira/browse/SPARK-18165> , Spark officially
>> do not have any kinesis connector
>>
>> We have few below questions , It would be great if you can answer
>>
>>1. Does Spark provides officially any kinesis connector which have
>>readstream/writestream and endorse any connector for production use cases 
>> ?
>>
>>2.
>>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>> This
>>documentation does not mention how to write to kinesis. This method has
>>default dynamodb as checkpoint, can we override it ?
>>3. We have rocksdb as a state store but when we ran an application
>>using official
>>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>> rocksdb
>>configurations were not effective. Can you please confirm if rocksdb is 
>> not
>>applicable in these cases?
>>4. rocksdb however works with qubole connector , do you have any plan
>>to release kinesis connector?
>>5. Please help/recommend us for any good stable kinesis connector or
>>some pointers around it
>>
>>


RE: spark streaming and kinesis integration

2023-04-06 Thread Jonske, Kurt
unsubscribe

Regards,
Kurt Jonske
Senior Director
Alvarez & Marsal
Direct:  212 328 8532
Mobile:  312 560 5040
Email:  kjon...@alvarezandmarsal.com<mailto:kjon...@alvarezandmarsal.com>
www.alvarezandmarsal.com

From: Mich Talebzadeh 
Sent: Thursday, April 06, 2023 11:45 AM
To: Rajesh Katkar 
Cc: u...@spark.incubator.apache.org
Subject: Re: spark streaming and kinesis integration


⚠ [EXTERNAL EMAIL]: Use Caution

Do you have a high level diagram of the proposed solution?

In so far as I know k8s does not support spark structured streaming?

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://protect-us.mimecast.com/s/geRNCR61G4svBlOwGI9l42n?domain=linkedin.com/>

 
https://en.everybodywiki.com/Mich_Talebzadeh<https://protect-us.mimecast.com/s/IvkpCVOQM8Tx9KZV2szZ50n?domain=en.everybodywiki.com>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar 
mailto:katkar.raj...@gmail.com>> wrote:
Use case is , we want to read/write to kinesis streams using k8s
Officially I could not find the connector or reader for kinesis from spark like 
it has for kafka.

Checking here if anyone used kinesis and spark streaming combination ?

On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, 
mailto:mich.talebza...@gmail.com>> wrote:
Hi Rajesh,

What is the use case for Kinesis here? I have not used it personally, Which use 
case it concerns

https://aws.amazon.com/kinesis/<https://protect-us.mimecast.com/s/EbXfCW6qNgs5GY416iKUuW5?domain=aws.amazon.com/>

Can you use something else instead?

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://protect-us.mimecast.com/s/geRNCR61G4svBlOwGI9l42n?domain=linkedin.com/>

 
https://en.everybodywiki.com/Mich_Talebzadeh<https://protect-us.mimecast.com/s/IvkpCVOQM8Tx9KZV2szZ50n?domain=en.everybodywiki.com>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar 
mailto:katkar.raj...@gmail.com>> wrote:

Hi Spark Team,

We need to read/write the kinesis streams using spark streaming.

 We checked the official documentation - 
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html<https://protect-us.mimecast.com/s/pmRCCXD5OjTX0N9l4iksfyX?domain=spark.apache.org>

It does not mention kinesis connector. Alternative is - 
https://github.com/qubole/kinesis-sql<https://protect-us.mimecast.com/s/wqnCCYE5PksLOZ9KDiMx-Ed?domain=github.com>
 which is not active now.  This is now handed over here - 
https://github.com/roncemer/spark-sql-kinesis<https://protect-us.mimecast.com/s/D3qVCZ60Qls52Rj17iP85Ej?domain=github.com>

Also according to 
SPARK-18165<https://protect-us.mimecast.com/s/s6R_C1w4AmIM5mZr6CyDJHr?domain=issues.apache.org>
 , Spark officially do not have any kinesis connector

We have few below questions , It would be great if you can answer

  1.  Does Spark provides officially any kinesis connector which have 
readstream/writestream and endorse any connector for production use cases ?
  2.  
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html<https://protect-us.mimecast.com/s/pmRCCXD5OjTX0N9l4iksfyX?domain=spark.apache.org>
 This documentation does not mention how to write to kinesis. This method has 
default dynamodb as checkpoint, can we override it ?
  3.  We have rocksdb as a state store but when we ran an application using 
official  
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html<https://protect-us.mimecast.com/s/pmRCCXD5OjTX0N9l4iksfyX?domain=spark.apache.org>
 rocksdb configurations were not effective. Can you please confirm if rocksdb 
is not applicable in these cases?
  4.  rocksdb however works with qubole connector , do you have any plan to 
release kinesis connector?
  5.  Please help/recommend us for any good stable kinesis connector or some 
pointers around it


Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Do you have a high level diagram of the proposed solution?

In so far as I know k8s does not support spark structured streaming?

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar  wrote:

> Use case is , we want to read/write to kinesis streams using k8s
> Officially I could not find the connector or reader for kinesis from spark
> like it has for kafka.
>
> Checking here if anyone used kinesis and spark streaming combination ?
>
> On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, 
> wrote:
>
>> Hi Rajesh,
>>
>> What is the use case for Kinesis here? I have not used it personally,
>> Which use case it concerns
>>
>> https://aws.amazon.com/kinesis/
>>
>> Can you use something else instead?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar 
>> wrote:
>>
>>> Hi Spark Team,
>>>
>>> We need to read/write the kinesis streams using spark streaming.
>>>
>>>  We checked the official documentation -
>>> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
>>>
>>> It does not mention kinesis connector. Alternative is -
>>> https://github.com/qubole/kinesis-sql which is not active now.  This is
>>> now handed over here - https://github.com/roncemer/spark-sql-kinesis
>>>
>>> Also according to SPARK-18165
>>> <https://issues.apache.org/jira/browse/SPARK-18165> , Spark officially
>>> do not have any kinesis connector
>>>
>>> We have few below questions , It would be great if you can answer
>>>
>>>1. Does Spark provides officially any kinesis connector which have
>>>readstream/writestream and endorse any connector for production use 
>>> cases ?
>>>
>>>2.
>>>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>>> This
>>>documentation does not mention how to write to kinesis. This method has
>>>default dynamodb as checkpoint, can we override it ?
>>>3. We have rocksdb as a state store but when we ran an application
>>>using official
>>>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>>> rocksdb
>>>configurations were not effective. Can you please confirm if rocksdb is 
>>> not
>>>applicable in these cases?
>>>4. rocksdb however works with qubole connector , do you have any
>>>plan to release kinesis connector?
>>>5. Please help/recommend us for any good stable kinesis connector or
>>>some pointers around it
>>>
>>>


Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Hi Rajesh,

What is the use case for Kinesis here? I have not used it personally, Which
use case it concerns

https://aws.amazon.com/kinesis/

Can you use something else instead?

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar  wrote:

> Hi Spark Team,
>
> We need to read/write the kinesis streams using spark streaming.
>
>  We checked the official documentation -
> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
>
> It does not mention kinesis connector. Alternative is -
> https://github.com/qubole/kinesis-sql which is not active now.  This is
> now handed over here - https://github.com/roncemer/spark-sql-kinesis
>
> Also according to SPARK-18165
> <https://issues.apache.org/jira/browse/SPARK-18165> , Spark officially do
> not have any kinesis connector
>
> We have few below questions , It would be great if you can answer
>
>1. Does Spark provides officially any kinesis connector which have
>readstream/writestream and endorse any connector for production use cases ?
>
>2.
>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
> This
>documentation does not mention how to write to kinesis. This method has
>default dynamodb as checkpoint, can we override it ?
>3. We have rocksdb as a state store but when we ran an application
>using official
>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
> rocksdb
>configurations were not effective. Can you please confirm if rocksdb is not
>applicable in these cases?
>4. rocksdb however works with qubole connector , do you have any plan
>to release kinesis connector?
>5. Please help/recommend us for any good stable kinesis connector or
>some pointers around it
>
>


spark streaming and kinesis integration

2023-04-06 Thread Rajesh Katkar
Hi Spark Team,

We need to read/write the kinesis streams using spark streaming.

 We checked the official documentation -
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html

It does not mention kinesis connector. Alternative is -
https://github.com/qubole/kinesis-sql which is not active now.  This is now
handed over here - https://github.com/roncemer/spark-sql-kinesis

Also according to SPARK-18165
<https://issues.apache.org/jira/browse/SPARK-18165> , Spark officially do
not have any kinesis connector

We have few below questions , It would be great if you can answer

   1. Does Spark provides officially any kinesis connector which have
   readstream/writestream and endorse any connector for production use cases ?

   2.
   https://spark.apache.org/docs/latest/streaming-kinesis-integration.html This
   documentation does not mention how to write to kinesis. This method has
   default dynamodb as checkpoint, can we override it ?
   3. We have rocksdb as a state store but when we ran an application using
   official
   https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
rocksdb
   configurations were not effective. Can you please confirm if rocksdb is not
   applicable in these cases?
   4. rocksdb however works with qubole connector , do you have any plan to
   release kinesis connector?
   5. Please help/recommend us for any good stable kinesis connector or
   some pointers around it


Re: spark streaming with kinesis

2016-11-20 Thread Takeshi Yamamuro
"1 userid data" is ambiguous though (user-input data? stream? shard?),
since a kinesis worker fetch data from shards that the worker has an
ownership of, IIUC user-input data in a shard are transferred into an
assigned worker as long as you get no failure.

// maropu

On Mon, Nov 21, 2016 at 1:59 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Hi
>
> Thanks.
> Have a doubt on spark streaming kinesis consumer. Say I have a batch time
> of 500 ms and kiensis stream is partitioned on userid(uniformly
> distributed).But since IdleTimeBetweenReadsInMillis is set to 1000ms so
> Spark receiver nodes will fetch the data at interval of 1 second and store
> in InputDstream.
>
> 1. When worker executors will fetch the data from receiver at after every
> 500 ms does its gurantee that 1 userid data will go to one partition and
> that to one worker only always ?
> 2.If not - can I repartition stream data before processing? If yes how-
> since JavaDStream has only one method repartition which takes number of
> partitions and not the partitioner function ?So it will randomly
> repartition the Dstream data.
>
> Thanks
>
>
>
>
>
>
>
>
> On Tue, Nov 15, 2016 at 8:23 AM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Seems it it not a good design to frequently restart workers in a minute
>> because
>> their initialization and shutdown take much time as you said
>> (e.g., interconnection overheads with dynamodb and graceful shutdown).
>>
>> Anyway, since this is a kind of questions about the aws kinesis library,
>> so
>> you'd better to ask aws guys in their forum or something.
>>
>> // maropu
>>
>>
>> On Mon, Nov 14, 2016 at 11:20 PM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> 1.No, I want to implement low level consumer on kinesis stream.
>>> so need to stop the worker once it read the latest sequence number sent
>>> by driver.
>>>
>>> 2.What is the cost of frequent register and deregister of worker node.
>>> Is that when worker's shutdown is called it will terminate run method but
>>> leasecoordinator will wait for 2seconds before releasing the lease. So I
>>> cannot deregister a worker in less than 2 seconds ?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Mon, Nov 14, 2016 at 7:36 PM, Takeshi Yamamuro <linguin@gmail.com
>>> > wrote:
>>>
>>>> Is "aws kinesis get-shard-iterator --shard-iterator-type LATEST" not
>>>> enough for your usecase?
>>>>
>>>> On Mon, Nov 14, 2016 at 10:23 PM, Shushant Arora <
>>>> shushantaror...@gmail.com> wrote:
>>>>
>>>>> Thanks!
>>>>> Is there a way to get the latest sequence number of all shards of a
>>>>> kinesis stream?
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 14, 2016 at 5:43 PM, Takeshi Yamamuro <
>>>>> linguin@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The time interval can be controlled by `IdleTimeBetweenReadsInMillis`
>>>>>> in KinesisClientLibConfiguration though,
>>>>>> it is not configurable in the current implementation.
>>>>>>
>>>>>> The detail can be found in;
>>>>>> https://github.com/apache/spark/blob/master/external/kinesis
>>>>>> -asl/src/main/scala/org/apache/spark/streaming/kinesis/Kines
>>>>>> isReceiver.scala#L152
>>>>>>
>>>>>> // maropu
>>>>>>
>>>>>>
>>>>>> On Sun, Nov 13, 2016 at 12:08 AM, Shushant Arora <
>>>>>> shushantaror...@gmail.com> wrote:
>>>>>>
>>>>>>> *Hi *
>>>>>>>
>>>>>>> *is **spark.streaming.blockInterval* for kinesis input stream is
>>>>>>> hardcoded to 1 sec or is it configurable ? Time interval at which 
>>>>>>> receiver
>>>>>>> fetched data from kinesis .
>>>>>>>
>>>>>>> Means stream batch interval cannot be less than 
>>>>>>> *spark.streaming.blockInterval
>>>>>>> and this should be configrable , Also is there any minimum value for
>>>>>>> streaming batch interval ?*
>>>>>>>
>>>>>>> *Thanks*
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ---
>>>>>> Takeshi Yamamuro
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: spark streaming with kinesis

2016-11-20 Thread Shushant Arora
Hi

Thanks.
Have a doubt on spark streaming kinesis consumer. Say I have a batch time
of 500 ms and kiensis stream is partitioned on userid(uniformly
distributed).But since IdleTimeBetweenReadsInMillis is set to 1000ms so
Spark receiver nodes will fetch the data at interval of 1 second and store
in InputDstream.

1. When worker executors will fetch the data from receiver at after every
500 ms does its gurantee that 1 userid data will go to one partition and
that to one worker only always ?
2.If not - can I repartition stream data before processing? If yes how-
since JavaDStream has only one method repartition which takes number of
partitions and not the partitioner function ?So it will randomly
repartition the Dstream data.

Thanks








On Tue, Nov 15, 2016 at 8:23 AM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Seems it it not a good design to frequently restart workers in a minute
> because
> their initialization and shutdown take much time as you said
> (e.g., interconnection overheads with dynamodb and graceful shutdown).
>
> Anyway, since this is a kind of questions about the aws kinesis library, so
> you'd better to ask aws guys in their forum or something.
>
> // maropu
>
>
> On Mon, Nov 14, 2016 at 11:20 PM, Shushant Arora <
> shushantaror...@gmail.com> wrote:
>
>> 1.No, I want to implement low level consumer on kinesis stream.
>> so need to stop the worker once it read the latest sequence number sent
>> by driver.
>>
>> 2.What is the cost of frequent register and deregister of worker node. Is
>> that when worker's shutdown is called it will terminate run method but
>> leasecoordinator will wait for 2seconds before releasing the lease. So I
>> cannot deregister a worker in less than 2 seconds ?
>>
>> Thanks!
>>
>>
>>
>> On Mon, Nov 14, 2016 at 7:36 PM, Takeshi Yamamuro <linguin@gmail.com>
>> wrote:
>>
>>> Is "aws kinesis get-shard-iterator --shard-iterator-type LATEST" not
>>> enough for your usecase?
>>>
>>> On Mon, Nov 14, 2016 at 10:23 PM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>>
>>>> Thanks!
>>>> Is there a way to get the latest sequence number of all shards of a
>>>> kinesis stream?
>>>>
>>>>
>>>>
>>>> On Mon, Nov 14, 2016 at 5:43 PM, Takeshi Yamamuro <
>>>> linguin@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The time interval can be controlled by `IdleTimeBetweenReadsInMillis`
>>>>> in KinesisClientLibConfiguration though,
>>>>> it is not configurable in the current implementation.
>>>>>
>>>>> The detail can be found in;
>>>>> https://github.com/apache/spark/blob/master/external/kinesis
>>>>> -asl/src/main/scala/org/apache/spark/streaming/kinesis/Kines
>>>>> isReceiver.scala#L152
>>>>>
>>>>> // maropu
>>>>>
>>>>>
>>>>> On Sun, Nov 13, 2016 at 12:08 AM, Shushant Arora <
>>>>> shushantaror...@gmail.com> wrote:
>>>>>
>>>>>> *Hi *
>>>>>>
>>>>>> *is **spark.streaming.blockInterval* for kinesis input stream is
>>>>>> hardcoded to 1 sec or is it configurable ? Time interval at which 
>>>>>> receiver
>>>>>> fetched data from kinesis .
>>>>>>
>>>>>> Means stream batch interval cannot be less than 
>>>>>> *spark.streaming.blockInterval
>>>>>> and this should be configrable , Also is there any minimum value for
>>>>>> streaming batch interval ?*
>>>>>>
>>>>>> *Thanks*
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
Seems it it not a good design to frequently restart workers in a minute
because
their initialization and shutdown take much time as you said
(e.g., interconnection overheads with dynamodb and graceful shutdown).

Anyway, since this is a kind of questions about the aws kinesis library, so
you'd better to ask aws guys in their forum or something.

// maropu


On Mon, Nov 14, 2016 at 11:20 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> 1.No, I want to implement low level consumer on kinesis stream.
> so need to stop the worker once it read the latest sequence number sent by
> driver.
>
> 2.What is the cost of frequent register and deregister of worker node. Is
> that when worker's shutdown is called it will terminate run method but
> leasecoordinator will wait for 2seconds before releasing the lease. So I
> cannot deregister a worker in less than 2 seconds ?
>
> Thanks!
>
>
>
> On Mon, Nov 14, 2016 at 7:36 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Is "aws kinesis get-shard-iterator --shard-iterator-type LATEST" not
>> enough for your usecase?
>>
>> On Mon, Nov 14, 2016 at 10:23 PM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> Thanks!
>>> Is there a way to get the latest sequence number of all shards of a
>>> kinesis stream?
>>>
>>>
>>>
>>> On Mon, Nov 14, 2016 at 5:43 PM, Takeshi Yamamuro <linguin@gmail.com
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> The time interval can be controlled by `IdleTimeBetweenReadsInMillis`
>>>> in KinesisClientLibConfiguration though,
>>>> it is not configurable in the current implementation.
>>>>
>>>> The detail can be found in;
>>>> https://github.com/apache/spark/blob/master/external/kinesis
>>>> -asl/src/main/scala/org/apache/spark/streaming/kinesis/Kines
>>>> isReceiver.scala#L152
>>>>
>>>> // maropu
>>>>
>>>>
>>>> On Sun, Nov 13, 2016 at 12:08 AM, Shushant Arora <
>>>> shushantaror...@gmail.com> wrote:
>>>>
>>>>> *Hi *
>>>>>
>>>>> *is **spark.streaming.blockInterval* for kinesis input stream is
>>>>> hardcoded to 1 sec or is it configurable ? Time interval at which receiver
>>>>> fetched data from kinesis .
>>>>>
>>>>> Means stream batch interval cannot be less than 
>>>>> *spark.streaming.blockInterval
>>>>> and this should be configrable , Also is there any minimum value for
>>>>> streaming batch interval ?*
>>>>>
>>>>> *Thanks*
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: spark streaming with kinesis

2016-11-14 Thread Shushant Arora
1.No, I want to implement low level consumer on kinesis stream.
so need to stop the worker once it read the latest sequence number sent by
driver.

2.What is the cost of frequent register and deregister of worker node. Is
that when worker's shutdown is called it will terminate run method but
leasecoordinator will wait for 2seconds before releasing the lease. So I
cannot deregister a worker in less than 2 seconds ?

Thanks!



On Mon, Nov 14, 2016 at 7:36 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Is "aws kinesis get-shard-iterator --shard-iterator-type LATEST" not
> enough for your usecase?
>
> On Mon, Nov 14, 2016 at 10:23 PM, Shushant Arora <
> shushantaror...@gmail.com> wrote:
>
>> Thanks!
>> Is there a way to get the latest sequence number of all shards of a
>> kinesis stream?
>>
>>
>>
>> On Mon, Nov 14, 2016 at 5:43 PM, Takeshi Yamamuro <linguin@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> The time interval can be controlled by `IdleTimeBetweenReadsInMillis`
>>> in KinesisClientLibConfiguration though,
>>> it is not configurable in the current implementation.
>>>
>>> The detail can be found in;
>>> https://github.com/apache/spark/blob/master/external/kinesis
>>> -asl/src/main/scala/org/apache/spark/streaming/kinesis/
>>> KinesisReceiver.scala#L152
>>>
>>> // maropu
>>>
>>>
>>> On Sun, Nov 13, 2016 at 12:08 AM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>>
>>>> *Hi *
>>>>
>>>> *is **spark.streaming.blockInterval* for kinesis input stream is
>>>> hardcoded to 1 sec or is it configurable ? Time interval at which receiver
>>>> fetched data from kinesis .
>>>>
>>>> Means stream batch interval cannot be less than 
>>>> *spark.streaming.blockInterval
>>>> and this should be configrable , Also is there any minimum value for
>>>> streaming batch interval ?*
>>>>
>>>> *Thanks*
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
Is "aws kinesis get-shard-iterator --shard-iterator-type LATEST" not enough
for your usecase?

On Mon, Nov 14, 2016 at 10:23 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Thanks!
> Is there a way to get the latest sequence number of all shards of a
> kinesis stream?
>
>
>
> On Mon, Nov 14, 2016 at 5:43 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
>> The time interval can be controlled by `IdleTimeBetweenReadsInMillis`
>> in KinesisClientLibConfiguration though,
>> it is not configurable in the current implementation.
>>
>> The detail can be found in;
>> https://github.com/apache/spark/blob/master/external/kinesis
>> -asl/src/main/scala/org/apache/spark/streaming/kinesis
>> /KinesisReceiver.scala#L152
>>
>> // maropu
>>
>>
>> On Sun, Nov 13, 2016 at 12:08 AM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> *Hi *
>>>
>>> *is **spark.streaming.blockInterval* for kinesis input stream is
>>> hardcoded to 1 sec or is it configurable ? Time interval at which receiver
>>> fetched data from kinesis .
>>>
>>> Means stream batch interval cannot be less than 
>>> *spark.streaming.blockInterval
>>> and this should be configrable , Also is there any minimum value for
>>> streaming batch interval ?*
>>>
>>> *Thanks*
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: spark streaming with kinesis

2016-11-14 Thread Shushant Arora
Thanks!
Is there a way to get the latest sequence number of all shards of a kinesis
stream?



On Mon, Nov 14, 2016 at 5:43 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Hi,
>
> The time interval can be controlled by `IdleTimeBetweenReadsInMillis` in 
> KinesisClientLibConfiguration
> though,
> it is not configurable in the current implementation.
>
> The detail can be found in;
> https://github.com/apache/spark/blob/master/external/
> kinesis-asl/src/main/scala/org/apache/spark/streaming/
> kinesis/KinesisReceiver.scala#L152
>
> // maropu
>
>
> On Sun, Nov 13, 2016 at 12:08 AM, Shushant Arora <
> shushantaror...@gmail.com> wrote:
>
>> *Hi *
>>
>> *is **spark.streaming.blockInterval* for kinesis input stream is
>> hardcoded to 1 sec or is it configurable ? Time interval at which receiver
>> fetched data from kinesis .
>>
>> Means stream batch interval cannot be less than 
>> *spark.streaming.blockInterval
>> and this should be configrable , Also is there any minimum value for
>> streaming batch interval ?*
>>
>> *Thanks*
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
Hi,

The time interval can be controlled by `IdleTimeBetweenReadsInMillis`
in KinesisClientLibConfiguration though,
it is not configurable in the current implementation.

The detail can be found in;
https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala#L152

// maropu


On Sun, Nov 13, 2016 at 12:08 AM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> *Hi *
>
> *is **spark.streaming.blockInterval* for kinesis input stream is
> hardcoded to 1 sec or is it configurable ? Time interval at which receiver
> fetched data from kinesis .
>
> Means stream batch interval cannot be less than *spark.streaming.blockInterval
> and this should be configrable , Also is there any minimum value for
> streaming batch interval ?*
>
> *Thanks*
>
>


-- 
---
Takeshi Yamamuro


spark streaming with kinesis

2016-11-12 Thread Shushant Arora
*Hi *

*is **spark.streaming.blockInterval* for kinesis input stream is hardcoded
to 1 sec or is it configurable ? Time interval at which receiver fetched
data from kinesis .

Means stream batch interval cannot be less than *spark.streaming.blockInterval
and this should be configrable , Also is there any minimum value for
streaming batch interval ?*

*Thanks*


Re: spark streaming with kinesis

2016-11-07 Thread Takeshi Yamamuro
I'm not familiar with the kafka implementation though, a kinesis receiver
runs in a thread of executors.
You can set any value in the interval, but frequent checkpoints cause
excess loads in dynamodb.
See:
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html#kinesis-checkpointing

// maropu

On Mon, Nov 7, 2016 at 1:36 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Hi
>
> By receicer I meant spark streaming receiver architecture- means worker
> nodes are different than receiver nodes. There is no direct consumer/low
> level consumer like of  Kafka in kinesis spark streaming?
>
> Is there any limitation on interval checkpoint - minimum of 1second in
> spark streaming with kinesis. But as such there is no limit on checkpoint
> interval in KCL side ?
>
> Thanks
>
> On Tue, Oct 25, 2016 at 8:36 AM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> I'm not exactly sure about the receiver you pointed though,
>> if you point the "KinesisReceiver" implementation, yes.
>>
>> Also, we currently cannot disable the interval checkpoints.
>>
>> On Tue, Oct 25, 2016 at 11:53 AM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> Thanks!
>>>
>>> Is kinesis streams are receiver based only? Is there non receiver based
>>> consumer for Kinesis ?
>>>
>>> And Instead of having fixed checkpoint interval,Can I disable auto
>>> checkpoint and say  when my worker has processed the data after last record
>>> of mapPartition now checkpoint the sequence no using some api.
>>>
>>>
>>>
>>> On Tue, Oct 25, 2016 at 7:07 AM, Takeshi Yamamuro <linguin@gmail.com
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> The only thing you can do for Kinesis checkpoints is tune the interval
>>>> of them.
>>>> https://github.com/apache/spark/blob/master/external/kinesis
>>>> -asl/src/main/scala/org/apache/spark/streaming/kinesis/Kines
>>>> isUtils.scala#L68
>>>>
>>>> Whether the dataloss occurs or not depends on the storage level you set;
>>>> if you set StorageLevel.MEMORY_AND_DISK_2, Spark may continue
>>>> processing
>>>> in case of the dataloss because the stream data Spark receives are
>>>> replicated across executors.
>>>> However,  all the executors that have the replicated data crash,
>>>> IIUC the dataloss occurs.
>>>>
>>>> // maropu
>>>>
>>>> On Mon, Oct 24, 2016 at 4:43 PM, Shushant Arora <
>>>> shushantaror...@gmail.com> wrote:
>>>>
>>>>> Does spark streaming consumer for kinesis uses Kinesis Client Library
>>>>>  and mandates to checkpoint the sequence number of shards in dynamo db.
>>>>>
>>>>> Will it lead to dataloss if consumed datarecords are not yet processed
>>>>> and kinesis checkpointed the consumed sequenece numbers in dynamo db and
>>>>> spark worker crashes - then spark launched the worker on another node but
>>>>> start consuming from dynamo db's checkpointed sequence number which is
>>>>> ahead of processed sequenece number .
>>>>>
>>>>> is there a way to checkpoint the sequenece numbers ourselves in
>>>>> Kinesis as it is in Kafka low level consumer ?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: spark streaming with kinesis

2016-11-06 Thread Shushant Arora
Hi

By receicer I meant spark streaming receiver architecture- means worker
nodes are different than receiver nodes. There is no direct consumer/low
level consumer like of  Kafka in kinesis spark streaming?

Is there any limitation on interval checkpoint - minimum of 1second in
spark streaming with kinesis. But as such there is no limit on checkpoint
interval in KCL side ?

Thanks

On Tue, Oct 25, 2016 at 8:36 AM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> I'm not exactly sure about the receiver you pointed though,
> if you point the "KinesisReceiver" implementation, yes.
>
> Also, we currently cannot disable the interval checkpoints.
>
> On Tue, Oct 25, 2016 at 11:53 AM, Shushant Arora <
> shushantaror...@gmail.com> wrote:
>
>> Thanks!
>>
>> Is kinesis streams are receiver based only? Is there non receiver based
>> consumer for Kinesis ?
>>
>> And Instead of having fixed checkpoint interval,Can I disable auto
>> checkpoint and say  when my worker has processed the data after last record
>> of mapPartition now checkpoint the sequence no using some api.
>>
>>
>>
>> On Tue, Oct 25, 2016 at 7:07 AM, Takeshi Yamamuro <linguin@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> The only thing you can do for Kinesis checkpoints is tune the interval
>>> of them.
>>> https://github.com/apache/spark/blob/master/external/kinesis
>>> -asl/src/main/scala/org/apache/spark/streaming/kinesis/Kines
>>> isUtils.scala#L68
>>>
>>> Whether the dataloss occurs or not depends on the storage level you set;
>>> if you set StorageLevel.MEMORY_AND_DISK_2, Spark may continue processing
>>> in case of the dataloss because the stream data Spark receives are
>>> replicated across executors.
>>> However,  all the executors that have the replicated data crash,
>>> IIUC the dataloss occurs.
>>>
>>> // maropu
>>>
>>> On Mon, Oct 24, 2016 at 4:43 PM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>>
>>>> Does spark streaming consumer for kinesis uses Kinesis Client Library
>>>>  and mandates to checkpoint the sequence number of shards in dynamo db.
>>>>
>>>> Will it lead to dataloss if consumed datarecords are not yet processed
>>>> and kinesis checkpointed the consumed sequenece numbers in dynamo db and
>>>> spark worker crashes - then spark launched the worker on another node but
>>>> start consuming from dynamo db's checkpointed sequence number which is
>>>> ahead of processed sequenece number .
>>>>
>>>> is there a way to checkpoint the sequenece numbers ourselves in Kinesis
>>>> as it is in Kafka low level consumer ?
>>>>
>>>> Thanks
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Spark Streaming and Kinesis

2016-10-27 Thread Benjamin Kim
Has anyone worked with AWS Kinesis and retrieved data from it using Spark 
Streaming? I am having issues where it’s returning no data. I can connect to 
the Kinesis stream and describe using Spark. Is there something I’m missing? 
Are there specific IAM security settings needed? I just simply followed the 
Word Count ASL example. When it didn’t work, I even tried to run the code 
independently in Spark shell in yarn-client mode by hardcoding the arguments. 
Still, there was no data even with the setting InitialPositionInStream.LATEST 
changed to InitialPositionInStream.TRIM_HORIZON.

If anyone can help, I would truly appreciate it.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark streaming with kinesis

2016-10-24 Thread Takeshi Yamamuro
I'm not exactly sure about the receiver you pointed though,
if you point the "KinesisReceiver" implementation, yes.

Also, we currently cannot disable the interval checkpoints.

On Tue, Oct 25, 2016 at 11:53 AM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Thanks!
>
> Is kinesis streams are receiver based only? Is there non receiver based
> consumer for Kinesis ?
>
> And Instead of having fixed checkpoint interval,Can I disable auto
> checkpoint and say  when my worker has processed the data after last record
> of mapPartition now checkpoint the sequence no using some api.
>
>
>
> On Tue, Oct 25, 2016 at 7:07 AM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
>> The only thing you can do for Kinesis checkpoints is tune the interval of
>> them.
>> https://github.com/apache/spark/blob/master/external/kinesis
>> -asl/src/main/scala/org/apache/spark/streaming/kinesis/
>> KinesisUtils.scala#L68
>>
>> Whether the dataloss occurs or not depends on the storage level you set;
>> if you set StorageLevel.MEMORY_AND_DISK_2, Spark may continue processing
>> in case of the dataloss because the stream data Spark receives are
>> replicated across executors.
>> However,  all the executors that have the replicated data crash,
>> IIUC the dataloss occurs.
>>
>> // maropu
>>
>> On Mon, Oct 24, 2016 at 4:43 PM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> Does spark streaming consumer for kinesis uses Kinesis Client Library
>>>  and mandates to checkpoint the sequence number of shards in dynamo db.
>>>
>>> Will it lead to dataloss if consumed datarecords are not yet processed
>>> and kinesis checkpointed the consumed sequenece numbers in dynamo db and
>>> spark worker crashes - then spark launched the worker on another node but
>>> start consuming from dynamo db's checkpointed sequence number which is
>>> ahead of processed sequenece number .
>>>
>>> is there a way to checkpoint the sequenece numbers ourselves in Kinesis
>>> as it is in Kafka low level consumer ?
>>>
>>> Thanks
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: spark streaming with kinesis

2016-10-24 Thread Shushant Arora
Thanks!

Is kinesis streams are receiver based only? Is there non receiver based
consumer for Kinesis ?

And Instead of having fixed checkpoint interval,Can I disable auto
checkpoint and say  when my worker has processed the data after last record
of mapPartition now checkpoint the sequence no using some api.



On Tue, Oct 25, 2016 at 7:07 AM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Hi,
>
> The only thing you can do for Kinesis checkpoints is tune the interval of
> them.
> https://github.com/apache/spark/blob/master/external/kinesis
> -asl/src/main/scala/org/apache/spark/streaming/kinesis
> /KinesisUtils.scala#L68
>
> Whether the dataloss occurs or not depends on the storage level you set;
> if you set StorageLevel.MEMORY_AND_DISK_2, Spark may continue processing
> in case of the dataloss because the stream data Spark receives are
> replicated across executors.
> However,  all the executors that have the replicated data crash,
> IIUC the dataloss occurs.
>
> // maropu
>
> On Mon, Oct 24, 2016 at 4:43 PM, Shushant Arora <shushantaror...@gmail.com
> > wrote:
>
>> Does spark streaming consumer for kinesis uses Kinesis Client Library
>>  and mandates to checkpoint the sequence number of shards in dynamo db.
>>
>> Will it lead to dataloss if consumed datarecords are not yet processed
>> and kinesis checkpointed the consumed sequenece numbers in dynamo db and
>> spark worker crashes - then spark launched the worker on another node but
>> start consuming from dynamo db's checkpointed sequence number which is
>> ahead of processed sequenece number .
>>
>> is there a way to checkpoint the sequenece numbers ourselves in Kinesis
>> as it is in Kafka low level consumer ?
>>
>> Thanks
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: spark streaming with kinesis

2016-10-24 Thread Takeshi Yamamuro
Hi,

The only thing you can do for Kinesis checkpoints is tune the interval of
them.
https://github.com/apache/spark/blob/master/external/
kinesis-asl/src/main/scala/org/apache/spark/streaming/
kinesis/KinesisUtils.scala#L68

Whether the dataloss occurs or not depends on the storage level you set;
if you set StorageLevel.MEMORY_AND_DISK_2, Spark may continue processing
in case of the dataloss because the stream data Spark receives are
replicated across executors.
However,  all the executors that have the replicated data crash,
IIUC the dataloss occurs.

// maropu

On Mon, Oct 24, 2016 at 4:43 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Does spark streaming consumer for kinesis uses Kinesis Client Library
>  and mandates to checkpoint the sequence number of shards in dynamo db.
>
> Will it lead to dataloss if consumed datarecords are not yet processed and
> kinesis checkpointed the consumed sequenece numbers in dynamo db and spark
> worker crashes - then spark launched the worker on another node but start
> consuming from dynamo db's checkpointed sequence number which is ahead of
> processed sequenece number .
>
> is there a way to checkpoint the sequenece numbers ourselves in Kinesis as
> it is in Kafka low level consumer ?
>
> Thanks
>
>


-- 
---
Takeshi Yamamuro


spark streaming with kinesis

2016-10-24 Thread Shushant Arora
Does spark streaming consumer for kinesis uses Kinesis Client Library  and
mandates to checkpoint the sequence number of shards in dynamo db.

Will it lead to dataloss if consumed datarecords are not yet processed and
kinesis checkpointed the consumed sequenece numbers in dynamo db and spark
worker crashes - then spark launched the worker on another node but start
consuming from dynamo db's checkpointed sequence number which is ahead of
processed sequenece number .

is there a way to checkpoint the sequenece numbers ourselves in Kinesis as
it is in Kafka low level consumer ?

Thanks


Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
cc'ing dev list

Ok, looks like when the KCL version was updated in
https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
probably leading to dependency conflict, though as Burak mentions its hard
to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
in driver or worker logs, so any exception is getting swallowed somewhere.

Run starting. Expected test count is: 4
KinesisStreamSuite:
Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
Kinesis streams for tests.
- KinesisUtils API
- RDD generation
- basic operation *** FAILED ***
  The code passed to eventually never returned normally. Attempted 13 times
over 2.04 minutes. Last failure message: Set() did not equal Set(5, 10,
1, 6, 9, 2, 7, 3, 8, 4)
  Data received does not match data sent. (KinesisStreamSuite.scala:188)
- failure recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 63 times
over 2.02863831 minutes. Last failure message: isCheckpointPresent
was true, but 0 was not greater than 10. (KinesisStreamSuite.scala:228)
Run completed in 5 minutes, 0 seconds.
Total number of tests run: 4
Suites: completed 1, aborted 0
Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
*** 2 TESTS FAILED ***
[INFO]

[INFO] BUILD FAILURE
[INFO]



KCL 1.3.0 depends on *1.9.37* SDK (
https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
while the Spark Kinesis dependency was kept at *1.9.16.*

I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
1.9.37 and everything works.

Run starting. Expected test count is: 28
KinesisBackedBlockRDDSuite:
Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
Kinesis streams for tests.
- Basic reading from Kinesis
- Read data available in both block manager and Kinesis
- Read data available only in block manager, not in Kinesis
- Read data available only in Kinesis, not in block manager
- Read data available partially in block manager, rest in Kinesis
- Test isBlockValid skips block fetching from block manager
- Test whether RDD is valid after removing blocks from block anager
KinesisStreamSuite:
- KinesisUtils API
- RDD generation
- basic operation
- failure recovery
KinesisReceiverSuite:
- check serializability of SerializableAWSCredentials
- process records including store and checkpoint
- shouldn't store and checkpoint when receiver is stopped
- shouldn't checkpoint when exception occurs during store
- should set checkpoint time to currentTime + checkpoint interval upon
instantiation
- should checkpoint if we have exceeded the checkpoint interval
- shouldn't checkpoint if we have not exceeded the checkpoint interval
- should add to time when advancing checkpoint
- shutdown should checkpoint if the reason is TERMINATE
- shutdown should not checkpoint if the reason is something other than
TERMINATE
- retry success on first attempt
- retry success on second attempt after a Kinesis throttling exception
- retry success on second attempt after a Kinesis dependency exception
- retry failed after a shutdown exception
- retry failed after an invalid state exception
- retry failed after unexpected exception
- retry failed after exhausing all retries
Run completed in 3 minutes, 28 seconds.
Total number of tests run: 28
Suites: completed 4, aborted 0
Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

So this is a regression in Spark Streaming Kinesis 1.5.2 - @Brian can you
file a JIRA for this?

@dev-list, since KCL brings in AWS SDK dependencies itself, is it necessary
to declare an explicit dependency on aws-java-sdk in the Kinesis POM? Also,
from KCL 1.5.0+, only the relevant components used from the AWS SDKs are
brought in, making things a bit leaner (this can be upgraded in Spark
1.7/2.0 perhaps). All local tests (and integration tests) pass with
removing the explicit dependency and only depending on KCL. Is aws-java-sdk
used anywhere else (AFAIK it is not, but in case I missed something let me
know any good reason to keep the explicit dependency)?

N



On Fri, Dec 11, 2015 at 6:55 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Yeah also the integration tests need to be specifically run - I would have
> thought the contributor would have run those tests and also tested the
> change themselves using live Kinesis :(
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Fri, Dec 11, 2015 at 6:18 AM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> I don't think the Kinesis tests specifically ran when that was merged
>> into 1.5.2 :(
>> https://github.com/apache/spark/pull/8957
>>
>> https://github.com/apache/spark/commit/

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Brian London
Yes, it's against master: https://github.com/apache/spark/pull/10256

I'll push the KCL version bump after my local tests finish.

On Fri, Dec 11, 2015 at 10:42 AM Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Is that PR against master branch?
>
> S3 read comes from Hadoop / jet3t afaik
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Fri, Dec 11, 2015 at 5:38 PM, Brian London <brianmlon...@gmail.com>
> wrote:
>
>> That's good news  I've got a PR in to up the SDK version to 1.10.40 and
>> the KCL to 1.6.1 which I'm running tests on locally now.
>>
>> Is the AWS SDK not used for reading/writing from S3 or do we get that for
>> free from the Hadoop dependencies?
>>
>> On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath <nick.pentre...@gmail.com>
>> wrote:
>>
>>> cc'ing dev list
>>>
>>> Ok, looks like when the KCL version was updated in
>>> https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
>>> probably leading to dependency conflict, though as Burak mentions its hard
>>> to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
>>> and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
>>> in driver or worker logs, so any exception is getting swallowed somewhere.
>>>
>>> Run starting. Expected test count is: 4
>>> KinesisStreamSuite:
>>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>>> Kinesis streams for tests.
>>> - KinesisUtils API
>>> - RDD generation
>>> - basic operation *** FAILED ***
>>>   The code passed to eventually never returned normally. Attempted 13
>>> times over 2.04 minutes. Last failure message: Set() did not equal
>>> Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
>>>   Data received does not match data sent. (KinesisStreamSuite.scala:188)
>>> - failure recovery *** FAILED ***
>>>   The code passed to eventually never returned normally. Attempted 63
>>> times over 2.02863831 minutes. Last failure message:
>>> isCheckpointPresent was true, but 0 was not greater than 10.
>>> (KinesisStreamSuite.scala:228)
>>> Run completed in 5 minutes, 0 seconds.
>>> Total number of tests run: 4
>>> Suites: completed 1, aborted 0
>>> Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
>>> *** 2 TESTS FAILED ***
>>> [INFO]
>>> 
>>> [INFO] BUILD FAILURE
>>> [INFO]
>>> 
>>>
>>>
>>> KCL 1.3.0 depends on *1.9.37* SDK (
>>> https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
>>> while the Spark Kinesis dependency was kept at *1.9.16.*
>>>
>>> I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS
>>> SDK 1.9.37 and everything works.
>>>
>>> Run starting. Expected test count is: 28
>>> KinesisBackedBlockRDDSuite:
>>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>>> Kinesis streams for tests.
>>> - Basic reading from Kinesis
>>> - Read data available in both block manager and Kinesis
>>> - Read data available only in block manager, not in Kinesis
>>> - Read data available only in Kinesis, not in block manager
>>> - Read data available partially in block manager, rest in Kinesis
>>> - Test isBlockValid skips block fetching from block manager
>>> - Test whether RDD is valid after removing blocks from block anager
>>> KinesisStreamSuite:
>>> - KinesisUtils API
>>> - RDD generation
>>> - basic operation
>>> - failure recovery
>>> KinesisReceiverSuite:
>>> - check serializability of SerializableAWSCredentials
>>> - process records including store and checkpoint
>>> - shouldn't store and checkpoint when receiver is stopped
>>> - shouldn't checkpoint when exception occurs during store
>>> - should set checkpoint time to currentTime + checkpoint interval upon
>>> instantiation
>>> - should checkpoint if we have exceeded the checkpoint interval
>>> - shouldn't checkpoint if we have not exceeded the checkpoint interval
>>> - should add to time when advancing checkpoint
>>> - shutdown should checkpoint if the reason is TERMINATE
>>> - shutdown should not checkpoint if the reason is something other than
>>> TERMINATE
>>> - retry success on

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Brian London
That's good news  I've got a PR in to up the SDK version to 1.10.40 and the
KCL to 1.6.1 which I'm running tests on locally now.

Is the AWS SDK not used for reading/writing from S3 or do we get that for
free from the Hadoop dependencies?

On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> cc'ing dev list
>
> Ok, looks like when the KCL version was updated in
> https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
> probably leading to dependency conflict, though as Burak mentions its hard
> to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
> and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
> in driver or worker logs, so any exception is getting swallowed somewhere.
>
> Run starting. Expected test count is: 4
> KinesisStreamSuite:
> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
> Kinesis streams for tests.
> - KinesisUtils API
> - RDD generation
> - basic operation *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 13
> times over 2.04 minutes. Last failure message: Set() did not equal
> Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
>   Data received does not match data sent. (KinesisStreamSuite.scala:188)
> - failure recovery *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 63
> times over 2.02863831 minutes. Last failure message:
> isCheckpointPresent was true, but 0 was not greater than 10.
> (KinesisStreamSuite.scala:228)
> Run completed in 5 minutes, 0 seconds.
> Total number of tests run: 4
> Suites: completed 1, aborted 0
> Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
> *** 2 TESTS FAILED ***
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
>
>
> KCL 1.3.0 depends on *1.9.37* SDK (
> https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
> while the Spark Kinesis dependency was kept at *1.9.16.*
>
> I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
> 1.9.37 and everything works.
>
> Run starting. Expected test count is: 28
> KinesisBackedBlockRDDSuite:
> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
> Kinesis streams for tests.
> - Basic reading from Kinesis
> - Read data available in both block manager and Kinesis
> - Read data available only in block manager, not in Kinesis
> - Read data available only in Kinesis, not in block manager
> - Read data available partially in block manager, rest in Kinesis
> - Test isBlockValid skips block fetching from block manager
> - Test whether RDD is valid after removing blocks from block anager
> KinesisStreamSuite:
> - KinesisUtils API
> - RDD generation
> - basic operation
> - failure recovery
> KinesisReceiverSuite:
> - check serializability of SerializableAWSCredentials
> - process records including store and checkpoint
> - shouldn't store and checkpoint when receiver is stopped
> - shouldn't checkpoint when exception occurs during store
> - should set checkpoint time to currentTime + checkpoint interval upon
> instantiation
> - should checkpoint if we have exceeded the checkpoint interval
> - shouldn't checkpoint if we have not exceeded the checkpoint interval
> - should add to time when advancing checkpoint
> - shutdown should checkpoint if the reason is TERMINATE
> - shutdown should not checkpoint if the reason is something other than
> TERMINATE
> - retry success on first attempt
> - retry success on second attempt after a Kinesis throttling exception
> - retry success on second attempt after a Kinesis dependency exception
> - retry failed after a shutdown exception
> - retry failed after an invalid state exception
> - retry failed after unexpected exception
> - retry failed after exhausing all retries
> Run completed in 3 minutes, 28 seconds.
> Total number of tests run: 28
> Suites: completed 4, aborted 0
> Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
> All tests passed.
>
> So this is a regression in Spark Streaming Kinesis 1.5.2 - @Brian can you
> file a JIRA for this?
>
> @dev-list, since KCL brings in AWS SDK dependencies itself, is it
> necessary to declare an explicit dependency on aws-java-sdk in the Kinesis
> POM? Also, from KCL 1.5.0+, only the relevant components used from the AWS
> SDKs are brought in, making things a bit leaner (this can be upgraded in
> Spark 1.7/2.0 perhaps). All local tests (and integration tests) pass with
> removing the explicit dependency and only depending on KCL. Is aws-java-sdk
> used

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
Is that PR against master branch?




S3 read comes from Hadoop / jet3t afaik



—
Sent from Mailbox

On Fri, Dec 11, 2015 at 5:38 PM, Brian London <brianmlon...@gmail.com>
wrote:

> That's good news  I've got a PR in to up the SDK version to 1.10.40 and the
> KCL to 1.6.1 which I'm running tests on locally now.
> Is the AWS SDK not used for reading/writing from S3 or do we get that for
> free from the Hadoop dependencies?
> On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>> cc'ing dev list
>>
>> Ok, looks like when the KCL version was updated in
>> https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
>> probably leading to dependency conflict, though as Burak mentions its hard
>> to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
>> and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
>> in driver or worker logs, so any exception is getting swallowed somewhere.
>>
>> Run starting. Expected test count is: 4
>> KinesisStreamSuite:
>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>> Kinesis streams for tests.
>> - KinesisUtils API
>> - RDD generation
>> - basic operation *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 13
>> times over 2.04 minutes. Last failure message: Set() did not equal
>> Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
>>   Data received does not match data sent. (KinesisStreamSuite.scala:188)
>> - failure recovery *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 63
>> times over 2.02863831 minutes. Last failure message:
>> isCheckpointPresent was true, but 0 was not greater than 10.
>> (KinesisStreamSuite.scala:228)
>> Run completed in 5 minutes, 0 seconds.
>> Total number of tests run: 4
>> Suites: completed 1, aborted 0
>> Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
>> *** 2 TESTS FAILED ***
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>>
>>
>> KCL 1.3.0 depends on *1.9.37* SDK (
>> https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
>> while the Spark Kinesis dependency was kept at *1.9.16.*
>>
>> I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
>> 1.9.37 and everything works.
>>
>> Run starting. Expected test count is: 28
>> KinesisBackedBlockRDDSuite:
>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>> Kinesis streams for tests.
>> - Basic reading from Kinesis
>> - Read data available in both block manager and Kinesis
>> - Read data available only in block manager, not in Kinesis
>> - Read data available only in Kinesis, not in block manager
>> - Read data available partially in block manager, rest in Kinesis
>> - Test isBlockValid skips block fetching from block manager
>> - Test whether RDD is valid after removing blocks from block anager
>> KinesisStreamSuite:
>> - KinesisUtils API
>> - RDD generation
>> - basic operation
>> - failure recovery
>> KinesisReceiverSuite:
>> - check serializability of SerializableAWSCredentials
>> - process records including store and checkpoint
>> - shouldn't store and checkpoint when receiver is stopped
>> - shouldn't checkpoint when exception occurs during store
>> - should set checkpoint time to currentTime + checkpoint interval upon
>> instantiation
>> - should checkpoint if we have exceeded the checkpoint interval
>> - shouldn't checkpoint if we have not exceeded the checkpoint interval
>> - should add to time when advancing checkpoint
>> - shutdown should checkpoint if the reason is TERMINATE
>> - shutdown should not checkpoint if the reason is something other than
>> TERMINATE
>> - retry success on first attempt
>> - retry success on second attempt after a Kinesis throttling exception
>> - retry success on second attempt after a Kinesis dependency exception
>> - retry failed after a shutdown exception
>> - retry failed after an invalid state exception
>> - retry failed after unexpected exception
>> - retry failed after exhausing all retries
>> Run completed in 3 minutes, 28 seconds.
>> Total number of tests run: 28
>> Suites: completed 4, aborted 0
>> Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
>> All tests passed.
>>
>> So this is a regression

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Brian London
Nick's symptoms sound identical to mine.  I should mention that I just
pulled the latest version from github and it seems to be working there.  To
reproduce:


   1. Download spark 1.5.2 from http://spark.apache.org/downloads.html
   2. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests
   clean package
   3. build/mvn -Pkinesis-asl -DskipTests clean package
   4. Then run simultaneously:
   1. bin/run-example streaming.KinesisWordCountASL [Kinesis app name]
  [Kinesis stream name] [endpoint URL]
  2.   bin/run-example streaming.KinesisWordProducerASL [Kinesis stream
  name] [endpoint URL] 100 10


On Thu, Dec 10, 2015 at 2:05 PM Jean-Baptiste Onofré 
wrote:

> Hi Nick,
>
> Just to be sure: don't you see some ClassCastException in the log ?
>
> Thanks,
> Regards
> JB
>
> On 12/10/2015 07:56 PM, Nick Pentreath wrote:
> > Could you provide an example / test case and more detail on what issue
> > you're facing?
> >
> > I've just tested a simple program reading from a dev Kinesis stream and
> > using stream.print() to show the records, and it works under 1.5.1 but
> > doesn't appear to be working under 1.5.2.
> >
> > UI for 1.5.2:
> >
> > Inline image 1
> >
> > UI for 1.5.1:
> >
> > Inline image 2
> >
> > On Thu, Dec 10, 2015 at 5:50 PM, Brian London  > > wrote:
> >
> > Has anyone managed to run the Kinesis demo in Spark 1.5.2?  The
> > Kinesis ASL that ships with 1.5.2 appears to not work for me
> > although 1.5.1 is fine. I spent some time with Amazon earlier in the
> > week and the only thing we could do to make it work is to change the
> > version to 1.5.1.  Can someone please attempt to reproduce before I
> > open a JIRA issue for it?
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Jean-Baptiste Onofré

Hi Nick,

Just to be sure: don't you see some ClassCastException in the log ?

Thanks,
Regards
JB

On 12/10/2015 07:56 PM, Nick Pentreath wrote:

Could you provide an example / test case and more detail on what issue
you're facing?

I've just tested a simple program reading from a dev Kinesis stream and
using stream.print() to show the records, and it works under 1.5.1 but
doesn't appear to be working under 1.5.2.

UI for 1.5.2:

Inline image 1

UI for 1.5.1:

Inline image 2

On Thu, Dec 10, 2015 at 5:50 PM, Brian London > wrote:

Has anyone managed to run the Kinesis demo in Spark 1.5.2?  The
Kinesis ASL that ships with 1.5.2 appears to not work for me
although 1.5.1 is fine. I spent some time with Amazon earlier in the
week and the only thing we could do to make it work is to change the
version to 1.5.1.  Can someone please attempt to reproduce before I
open a JIRA issue for it?




--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Brian London
Yes, it worked in the 1.6 branch as of commit
db5165246f2888537dd0f3d4c5a515875c7358ed.  That makes it much less serious
of an issue, although it would be nice to know what the root cause is to
avoid a regression.

On Thu, Dec 10, 2015 at 4:03 PM Burak Yavuz  wrote:

> I've noticed this happening when there was some dependency conflicts, and
> it is super hard to debug.
> It seems that the KinesisClientLibrary version in Spark 1.5.2 is 1.3.0,
> but it is 1.2.1 in Spark 1.5.1.
> I feel like that seems to be the problem...
>
> Brian, did you verify that it works with the 1.6.0 branch?
>
> Thanks,
> Burak
>
> On Thu, Dec 10, 2015 at 11:45 AM, Brian London 
> wrote:
>
>> Nick's symptoms sound identical to mine.  I should mention that I just
>> pulled the latest version from github and it seems to be working there.  To
>> reproduce:
>>
>>
>>1. Download spark 1.5.2 from http://spark.apache.org/downloads.html
>>2. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests
>>clean package
>>3. build/mvn -Pkinesis-asl -DskipTests clean package
>>4. Then run simultaneously:
>>1. bin/run-example streaming.KinesisWordCountASL [Kinesis app name]
>>   [Kinesis stream name] [endpoint URL]
>>   2.   bin/run-example streaming.KinesisWordProducerASL [Kinesis
>>   stream name] [endpoint URL] 100 10
>>
>>
>> On Thu, Dec 10, 2015 at 2:05 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi Nick,
>>>
>>> Just to be sure: don't you see some ClassCastException in the log ?
>>>
>>> Thanks,
>>> Regards
>>> JB
>>>
>>> On 12/10/2015 07:56 PM, Nick Pentreath wrote:
>>> > Could you provide an example / test case and more detail on what issue
>>> > you're facing?
>>> >
>>> > I've just tested a simple program reading from a dev Kinesis stream and
>>> > using stream.print() to show the records, and it works under 1.5.1 but
>>> > doesn't appear to be working under 1.5.2.
>>> >
>>> > UI for 1.5.2:
>>> >
>>> > Inline image 1
>>> >
>>> > UI for 1.5.1:
>>> >
>>> > Inline image 2
>>> >
>>> > On Thu, Dec 10, 2015 at 5:50 PM, Brian London >> > > wrote:
>>> >
>>> > Has anyone managed to run the Kinesis demo in Spark 1.5.2?  The
>>> > Kinesis ASL that ships with 1.5.2 appears to not work for me
>>> > although 1.5.1 is fine. I spent some time with Amazon earlier in
>>> the
>>> > week and the only thing we could do to make it work is to change
>>> the
>>> > version to 1.5.1.  Can someone please attempt to reproduce before I
>>> > open a JIRA issue for it?
>>> >
>>> >
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
I've noticed this happening when there was some dependency conflicts, and
it is super hard to debug.
It seems that the KinesisClientLibrary version in Spark 1.5.2 is 1.3.0, but
it is 1.2.1 in Spark 1.5.1.
I feel like that seems to be the problem...

Brian, did you verify that it works with the 1.6.0 branch?

Thanks,
Burak

On Thu, Dec 10, 2015 at 11:45 AM, Brian London 
wrote:

> Nick's symptoms sound identical to mine.  I should mention that I just
> pulled the latest version from github and it seems to be working there.  To
> reproduce:
>
>
>1. Download spark 1.5.2 from http://spark.apache.org/downloads.html
>2. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests
>clean package
>3. build/mvn -Pkinesis-asl -DskipTests clean package
>4. Then run simultaneously:
>1. bin/run-example streaming.KinesisWordCountASL [Kinesis app name]
>   [Kinesis stream name] [endpoint URL]
>   2.   bin/run-example streaming.KinesisWordProducerASL [Kinesis
>   stream name] [endpoint URL] 100 10
>
>
> On Thu, Dec 10, 2015 at 2:05 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Nick,
>>
>> Just to be sure: don't you see some ClassCastException in the log ?
>>
>> Thanks,
>> Regards
>> JB
>>
>> On 12/10/2015 07:56 PM, Nick Pentreath wrote:
>> > Could you provide an example / test case and more detail on what issue
>> > you're facing?
>> >
>> > I've just tested a simple program reading from a dev Kinesis stream and
>> > using stream.print() to show the records, and it works under 1.5.1 but
>> > doesn't appear to be working under 1.5.2.
>> >
>> > UI for 1.5.2:
>> >
>> > Inline image 1
>> >
>> > UI for 1.5.1:
>> >
>> > Inline image 2
>> >
>> > On Thu, Dec 10, 2015 at 5:50 PM, Brian London > > > wrote:
>> >
>> > Has anyone managed to run the Kinesis demo in Spark 1.5.2?  The
>> > Kinesis ASL that ships with 1.5.2 appears to not work for me
>> > although 1.5.1 is fine. I spent some time with Amazon earlier in the
>> > week and the only thing we could do to make it work is to change the
>> > version to 1.5.1.  Can someone please attempt to reproduce before I
>> > open a JIRA issue for it?
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Nick Pentreath
Yup also works for me on master branch as I've been testing DynamoDB Streams 
integration. In fact works with latest KCL 1.6.1 also which I was using.




So theKCL version does seem like it could be the issue - somewhere along the 
line an exception must be getting swallowed. Though the tests should have 
picked this up? Will dig deeper.




—
Sent from Mailbox

On Thu, Dec 10, 2015 at 11:07 PM, Brian London 
wrote:

> Yes, it worked in the 1.6 branch as of commit
> db5165246f2888537dd0f3d4c5a515875c7358ed.  That makes it much less serious
> of an issue, although it would be nice to know what the root cause is to
> avoid a regression.
> On Thu, Dec 10, 2015 at 4:03 PM Burak Yavuz  wrote:
>> I've noticed this happening when there was some dependency conflicts, and
>> it is super hard to debug.
>> It seems that the KinesisClientLibrary version in Spark 1.5.2 is 1.3.0,
>> but it is 1.2.1 in Spark 1.5.1.
>> I feel like that seems to be the problem...
>>
>> Brian, did you verify that it works with the 1.6.0 branch?
>>
>> Thanks,
>> Burak
>>
>> On Thu, Dec 10, 2015 at 11:45 AM, Brian London 
>> wrote:
>>
>>> Nick's symptoms sound identical to mine.  I should mention that I just
>>> pulled the latest version from github and it seems to be working there.  To
>>> reproduce:
>>>
>>>
>>>1. Download spark 1.5.2 from http://spark.apache.org/downloads.html
>>>2. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests
>>>clean package
>>>3. build/mvn -Pkinesis-asl -DskipTests clean package
>>>4. Then run simultaneously:
>>>1. bin/run-example streaming.KinesisWordCountASL [Kinesis app name]
>>>   [Kinesis stream name] [endpoint URL]
>>>   2.   bin/run-example streaming.KinesisWordProducerASL [Kinesis
>>>   stream name] [endpoint URL] 100 10
>>>
>>>
>>> On Thu, Dec 10, 2015 at 2:05 PM Jean-Baptiste Onofré 
>>> wrote:
>>>
 Hi Nick,

 Just to be sure: don't you see some ClassCastException in the log ?

 Thanks,
 Regards
 JB

 On 12/10/2015 07:56 PM, Nick Pentreath wrote:
 > Could you provide an example / test case and more detail on what issue
 > you're facing?
 >
 > I've just tested a simple program reading from a dev Kinesis stream and
 > using stream.print() to show the records, and it works under 1.5.1 but
 > doesn't appear to be working under 1.5.2.
 >
 > UI for 1.5.2:
 >
 > Inline image 1
 >
 > UI for 1.5.1:
 >
 > Inline image 2
 >
 > On Thu, Dec 10, 2015 at 5:50 PM, Brian London  > wrote:
 >
 > Has anyone managed to run the Kinesis demo in Spark 1.5.2?  The
 > Kinesis ASL that ships with 1.5.2 appears to not work for me
 > although 1.5.1 is fine. I spent some time with Amazon earlier in
 the
 > week and the only thing we could do to make it work is to change
 the
 > version to 1.5.1.  Can someone please attempt to reproduce before I
 > open a JIRA issue for it?
 >
 >

 --
 Jean-Baptiste Onofré
 jbono...@apache.org
 http://blog.nanthrax.net
 Talend - http://www.talend.com

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
I don't think the Kinesis tests specifically ran when that was merged into
1.5.2 :(
https://github.com/apache/spark/pull/8957
https://github.com/apache/spark/commit/883bd8fccf83aae7a2a847c9a6ca129fac86e6a3

AFAIK pom changes don't trigger the Kinesis tests.

Burak

On Thu, Dec 10, 2015 at 8:09 PM, Nick Pentreath 
wrote:

> Yup also works for me on master branch as I've been testing DynamoDB
> Streams integration. In fact works with latest KCL 1.6.1 also which I was
> using.
>
> So theKCL version does seem like it could be the issue - somewhere along
> the line an exception must be getting swallowed. Though the tests should
> have picked this up? Will dig deeper.
>
> —
> Sent from Mailbox 
>
>
> On Thu, Dec 10, 2015 at 11:07 PM, Brian London 
> wrote:
>
>> Yes, it worked in the 1.6 branch as of commit
>> db5165246f2888537dd0f3d4c5a515875c7358ed.  That makes it much less
>> serious of an issue, although it would be nice to know what the root cause
>> is to avoid a regression.
>>
>> On Thu, Dec 10, 2015 at 4:03 PM Burak Yavuz  wrote:
>>
>>> I've noticed this happening when there was some dependency conflicts,
>>> and it is super hard to debug.
>>> It seems that the KinesisClientLibrary version in Spark 1.5.2 is 1.3.0,
>>> but it is 1.2.1 in Spark 1.5.1.
>>> I feel like that seems to be the problem...
>>>
>>> Brian, did you verify that it works with the 1.6.0 branch?
>>>
>>> Thanks,
>>> Burak
>>>
>>> On Thu, Dec 10, 2015 at 11:45 AM, Brian London 
>>> wrote:
>>>
 Nick's symptoms sound identical to mine.  I should mention that I just
 pulled the latest version from github and it seems to be working there.  To
 reproduce:


1. Download spark 1.5.2 from http://spark.apache.org/downloads.html
2. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests
clean package
3. build/mvn -Pkinesis-asl -DskipTests clean package
4. Then run simultaneously:
1. bin/run-example streaming.KinesisWordCountASL [Kinesis app name]
   [Kinesis stream name] [endpoint URL]
   2.   bin/run-example streaming.KinesisWordProducerASL [Kinesis
   stream name] [endpoint URL] 100 10


 On Thu, Dec 10, 2015 at 2:05 PM Jean-Baptiste Onofré 
 wrote:

> Hi Nick,
>
> Just to be sure: don't you see some ClassCastException in the log ?
>
> Thanks,
> Regards
> JB
>
> On 12/10/2015 07:56 PM, Nick Pentreath wrote:
> > Could you provide an example / test case and more detail on what
> issue
> > you're facing?
> >
> > I've just tested a simple program reading from a dev Kinesis stream
> and
> > using stream.print() to show the records, and it works under 1.5.1
> but
> > doesn't appear to be working under 1.5.2.
> >
> > UI for 1.5.2:
> >
> > Inline image 1
> >
> > UI for 1.5.1:
> >
> > Inline image 2
> >
> > On Thu, Dec 10, 2015 at 5:50 PM, Brian London <
> brianmlon...@gmail.com
> > > wrote:
> >
> > Has anyone managed to run the Kinesis demo in Spark 1.5.2?  The
> > Kinesis ASL that ships with 1.5.2 appears to not work for me
> > although 1.5.1 is fine. I spent some time with Amazon earlier in
> the
> > week and the only thing we could do to make it work is to change
> the
> > version to 1.5.1.  Can someone please attempt to reproduce
> before I
> > open a JIRA issue for it?
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>>>
>


Spark streaming with Kinesis broken?

2015-12-10 Thread Brian London
Has anyone managed to run the Kinesis demo in Spark 1.5.2?  The Kinesis ASL
that ships with 1.5.2 appears to not work for me although 1.5.1 is fine. I
spent some time with Amazon earlier in the week and the only thing we could
do to make it work is to change the version to 1.5.1.  Can someone please
attempt to reproduce before I open a JIRA issue for it?


Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Nick Pentreath
Yeah also the integration tests need to be specifically run - I would have 
thought the contributor would have run those tests and also tested the change 
themselves using live Kinesis :(



—
Sent from Mailbox

On Fri, Dec 11, 2015 at 6:18 AM, Burak Yavuz  wrote:

> I don't think the Kinesis tests specifically ran when that was merged into
> 1.5.2 :(
> https://github.com/apache/spark/pull/8957
> https://github.com/apache/spark/commit/883bd8fccf83aae7a2a847c9a6ca129fac86e6a3
> AFAIK pom changes don't trigger the Kinesis tests.
> Burak
> On Thu, Dec 10, 2015 at 8:09 PM, Nick Pentreath 
> wrote:
>> Yup also works for me on master branch as I've been testing DynamoDB
>> Streams integration. In fact works with latest KCL 1.6.1 also which I was
>> using.
>>
>> So theKCL version does seem like it could be the issue - somewhere along
>> the line an exception must be getting swallowed. Though the tests should
>> have picked this up? Will dig deeper.
>>
>> —
>> Sent from Mailbox 
>>
>>
>> On Thu, Dec 10, 2015 at 11:07 PM, Brian London 
>> wrote:
>>
>>> Yes, it worked in the 1.6 branch as of commit
>>> db5165246f2888537dd0f3d4c5a515875c7358ed.  That makes it much less
>>> serious of an issue, although it would be nice to know what the root cause
>>> is to avoid a regression.
>>>
>>> On Thu, Dec 10, 2015 at 4:03 PM Burak Yavuz  wrote:
>>>
 I've noticed this happening when there was some dependency conflicts,
 and it is super hard to debug.
 It seems that the KinesisClientLibrary version in Spark 1.5.2 is 1.3.0,
 but it is 1.2.1 in Spark 1.5.1.
 I feel like that seems to be the problem...

 Brian, did you verify that it works with the 1.6.0 branch?

 Thanks,
 Burak

 On Thu, Dec 10, 2015 at 11:45 AM, Brian London 
 wrote:

> Nick's symptoms sound identical to mine.  I should mention that I just
> pulled the latest version from github and it seems to be working there.  
> To
> reproduce:
>
>
>1. Download spark 1.5.2 from http://spark.apache.org/downloads.html
>2. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests
>clean package
>3. build/mvn -Pkinesis-asl -DskipTests clean package
>4. Then run simultaneously:
>1. bin/run-example streaming.KinesisWordCountASL [Kinesis app name]
>   [Kinesis stream name] [endpoint URL]
>   2.   bin/run-example streaming.KinesisWordProducerASL [Kinesis
>   stream name] [endpoint URL] 100 10
>
>
> On Thu, Dec 10, 2015 at 2:05 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Nick,
>>
>> Just to be sure: don't you see some ClassCastException in the log ?
>>
>> Thanks,
>> Regards
>> JB
>>
>> On 12/10/2015 07:56 PM, Nick Pentreath wrote:
>> > Could you provide an example / test case and more detail on what
>> issue
>> > you're facing?
>> >
>> > I've just tested a simple program reading from a dev Kinesis stream
>> and
>> > using stream.print() to show the records, and it works under 1.5.1
>> but
>> > doesn't appear to be working under 1.5.2.
>> >
>> > UI for 1.5.2:
>> >
>> > Inline image 1
>> >
>> > UI for 1.5.1:
>> >
>> > Inline image 2
>> >
>> > On Thu, Dec 10, 2015 at 5:50 PM, Brian London <
>> brianmlon...@gmail.com
>> > > wrote:
>> >
>> > Has anyone managed to run the Kinesis demo in Spark 1.5.2?  The
>> > Kinesis ASL that ships with 1.5.2 appears to not work for me
>> > although 1.5.1 is fine. I spent some time with Amazon earlier in
>> the
>> > week and the only thing we could do to make it work is to change
>> the
>> > version to 1.5.1.  Can someone please attempt to reproduce
>> before I
>> > open a JIRA issue for it?
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

>>

Re: Having problem with Spark streaming with Kinesis

2014-12-19 Thread Ashrafuzzaman
Thanks Aniket , clears a lot of confusion. 
On Dec 14, 2014 7:11 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com
wrote:

 The reason is because of the following code:

 val numStreams = numShards
 val kinesisStreams = (0 until numStreams).map { i =
   KinesisUtils.createStream(ssc, streamName, endpointUrl,
 kinesisCheckpointInterval,
   InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2)
 }

 In the above code, numStreams is set as numShards. This enforces the need
 to have #shards + 1 workers. If you set numStreams as Math.min(numShards,
 numAvailableWorkers - 1), you can have lesser number of workers than number
 of shards. Makes sense?

 On Sun Dec 14 2014 at 10:06:36 A.K.M. Ashrafuzzaman 
 ashrafuzzaman...@gmail.com wrote:

 Thanks Aniket,
 The trick is to have the #workers = #shards + 1. But I don’t know why is
 that.
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

 Here in the figure[spark streaming kinesis architecture], it seems like
 one node should be able to take on more than one shards.


 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred http://www.newscred.com/

 (M) 880-175-5592433
 Twitter https://twitter.com/ashrafuzzaman | Blog
 http://jitu-blog.blogspot.com/ | Facebook
 https://www.facebook.com/ashrafuzzaman.jitu

 Check out The Academy http://newscred.com/theacademy, your #1 source
 for free content marketing resources

 On Nov 26, 2014, at 6:23 PM, A.K.M. Ashrafuzzaman 
 ashrafuzzaman...@gmail.com wrote:

 Hi guys,
 When we are using Kinesis with 1 shard then it works fine. But when we
 use more that 1 then it falls into an infinite loop and no data is
 processed by the spark streaming. In the kinesis dynamo DB, I can see that
 it keeps increasing the leaseCounter. But it do start processing.

 I am using,
 scala: 2.10.4
 java version: 1.8.0_25
 Spark: 1.1.0
 spark-streaming-kinesis-asl: 1.1.0

 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred http://www.newscred.com/

 (M) 880-175-5592433
 Twitter https://twitter.com/ashrafuzzaman | Blog
 http://jitu-blog.blogspot.com/ | Facebook
 https://www.facebook.com/ashrafuzzaman.jitu

 Check out The Academy http://newscred.com/theacademy, your #1 source
 for free content marketing resources





Re: Having problem with Spark streaming with Kinesis

2014-12-14 Thread Aniket Bhatnagar
The reason is because of the following code:

val numStreams = numShards
val kinesisStreams = (0 until numStreams).map { i =
  KinesisUtils.createStream(ssc, streamName, endpointUrl,
kinesisCheckpointInterval,
  InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2)
}

In the above code, numStreams is set as numShards. This enforces the need
to have #shards + 1 workers. If you set numStreams as Math.min(numShards,
numAvailableWorkers - 1), you can have lesser number of workers than number
of shards. Makes sense?

On Sun Dec 14 2014 at 10:06:36 A.K.M. Ashrafuzzaman 
ashrafuzzaman...@gmail.com wrote:

 Thanks Aniket,
 The trick is to have the #workers = #shards + 1. But I don’t know why is
 that.
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

 Here in the figure[spark streaming kinesis architecture], it seems like
 one node should be able to take on more than one shards.


 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred http://www.newscred.com/

 (M) 880-175-5592433
 Twitter https://twitter.com/ashrafuzzaman | Blog
 http://jitu-blog.blogspot.com/ | Facebook
 https://www.facebook.com/ashrafuzzaman.jitu

 Check out The Academy http://newscred.com/theacademy, your #1 source
 for free content marketing resources

 On Nov 26, 2014, at 6:23 PM, A.K.M. Ashrafuzzaman 
 ashrafuzzaman...@gmail.com wrote:

 Hi guys,
 When we are using Kinesis with 1 shard then it works fine. But when we use
 more that 1 then it falls into an infinite loop and no data is processed by
 the spark streaming. In the kinesis dynamo DB, I can see that it keeps
 increasing the leaseCounter. But it do start processing.

 I am using,
 scala: 2.10.4
 java version: 1.8.0_25
 Spark: 1.1.0
 spark-streaming-kinesis-asl: 1.1.0

 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred http://www.newscred.com/

 (M) 880-175-5592433
 Twitter https://twitter.com/ashrafuzzaman | Blog
 http://jitu-blog.blogspot.com/ | Facebook
 https://www.facebook.com/ashrafuzzaman.jitu

 Check out The Academy http://newscred.com/theacademy, your #1 source
 for free content marketing resources





Re: Having problem with Spark streaming with Kinesis

2014-12-13 Thread A.K.M. Ashrafuzzaman
Thanks Aniket,
The trick is to have the #workers = #shards + 1. But I don’t know why is that.
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

Here in the figure[spark streaming kinesis architecture], it seems like one 
node should be able to take on more than one shards.


A.K.M. Ashrafuzzaman
Lead Software Engineer
NewsCred

(M) 880-175-5592433
Twitter | Blog | Facebook

Check out The Academy, your #1 source
for free content marketing resources

On Nov 26, 2014, at 6:23 PM, A.K.M. Ashrafuzzaman ashrafuzzaman...@gmail.com 
wrote:

 Hi guys,
 When we are using Kinesis with 1 shard then it works fine. But when we use 
 more that 1 then it falls into an infinite loop and no data is processed by 
 the spark streaming. In the kinesis dynamo DB, I can see that it keeps 
 increasing the leaseCounter. But it do start processing.
 
 I am using,
 scala: 2.10.4
 java version: 1.8.0_25
 Spark: 1.1.0
 spark-streaming-kinesis-asl: 1.1.0
 
 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred
 
 (M) 880-175-5592433
 Twitter | Blog | Facebook
 
 Check out The Academy, your #1 source
 for free content marketing resources
 



Re: Having problem with Spark streaming with Kinesis

2014-12-03 Thread A.K.M. Ashrafuzzaman
Guys,
In my local machine it consumes a stream of Kinesis with 3 shards. But in EC2 
it does not consume from the stream. Later we found that the EC2 machine was of 
2 cores and my local machine was of 4 cores. I am using a single machine and in 
spark standalone mode. And we got a larger machine from EC2 and now the kinesis 
is getting consumed.

4 cores Single machine - works
2 cores Single machine - does not work
2 cores 2 workers - does not work

So my question is that do we need a cluster of (#KinesisShards + 1) workers to 
be able to consume from Kinesis?


A.K.M. Ashrafuzzaman
Lead Software Engineer
NewsCred

(M) 880-175-5592433
Twitter | Blog | Facebook

Check out The Academy, your #1 source
for free content marketing resources

On Nov 27, 2014, at 10:28 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com 
wrote:

 Did you set spark master as local[*]? If so, then it means that nunber of 
 executors is equal to number of cores of the machine. Perhaps your mac 
 machine has more cores (certainly more than number of kinesis shards +1).
 
 Try explicitly setting master as local[N] where N is number of kinesis shards 
 + 1. It should then work on both the machines.
 
 On Thu, Nov 27, 2014, 9:46 AM Ashrafuzzaman ashrafuzzaman...@gmail.com 
 wrote:
 I was trying in one machine with just sbt run.
 
 And it is working with my mac environment with the same configuration.
 
 I used the sample code from 
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 
 
 val kinesisClient = new AmazonKinesisClient(new 
 DefaultAWSCredentialsProviderChain())
 kinesisClient.setEndpoint(endpointUrl)
 val numShards = 
 kinesisClient.describeStream(streamName).getStreamDescription().getShards()
   .size()
 
 /* In this example, we're going to create 1 Kinesis Worker/Receiver/DStream 
 for each shard. */
 val numStreams = numShards
 
 /* Setup the and SparkConfig and StreamingContext */
 /* Spark Streaming batch interval */
 val batchInterval = Milliseconds(2000)
 val sparkConfig = new SparkConf().setAppName(KinesisWordCount)
 val ssc = new StreamingContext(sparkConfig, batchInterval)
 
 /* Kinesis checkpoint interval.  Same as batchInterval for this example. */
 val kinesisCheckpointInterval = batchInterval
 
 /* Create the same number of Kinesis DStreams/Receivers as Kinesis stream's 
 shards */
 val kinesisStreams = (0 until numStreams).map { i =
   KinesisUtils.createStream(ssc, streamName, endpointUrl, 
 kinesisCheckpointInterval,
   InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2)
 }
 
 /* Union all the streams */
 val unionStreams = ssc.union(kinesisStreams)
 
 /* Convert each line of Array[Byte] to String, split into words, and count 
 them */
 val words = unionStreams.flatMap(byteArray = new String(byteArray)
   .split( ))
 
 /* Map each word to a (word, 1) tuple so we can reduce/aggregate by key. */
 val wordCounts = words.map(word = (word, 1)).reduceByKey(_ + _)
 
 /* Print the first 10 wordCounts */
 wordCounts.print()
 
 /* Start the streaming context and await termination */
 ssc.start()
 ssc.awaitTermination()
 
 
 
 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred
 
 (M) 880-175-5592433
 Twitter | Blog | Facebook
 
 Check out The Academy, your #1 source
 for free content marketing resources
 
 On Wed, Nov 26, 2014 at 11:26 PM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:
 What's your cluster size? For streamig to work, it needs shards + 1 executors.
 
 On Wed, Nov 26, 2014, 5:53 PM A.K.M. Ashrafuzzaman 
 ashrafuzzaman...@gmail.com wrote:
 Hi guys,
 When we are using Kinesis with 1 shard then it works fine. But when we use 
 more that 1 then it falls into an infinite loop and no data is processed by 
 the spark streaming. In the kinesis dynamo DB, I can see that it keeps 
 increasing the leaseCounter. But it do start processing.
 
 I am using,
 scala: 2.10.4
 java version: 1.8.0_25
 Spark: 1.1.0
 spark-streaming-kinesis-asl: 1.1.0
 
 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred
 
 (M) 880-175-5592433
 Twitter | Blog | Facebook
 
 Check out The Academy, your #1 source
 for free content marketing resources
 
 



Having problem with Spark streaming with Kinesis

2014-11-26 Thread A.K.M. Ashrafuzzaman
Hi guys,
When we are using Kinesis with 1 shard then it works fine. But when we use more 
that 1 then it falls into an infinite loop and no data is processed by the 
spark streaming. In the kinesis dynamo DB, I can see that it keeps increasing 
the leaseCounter. But it do start processing.

I am using,
scala: 2.10.4
java version: 1.8.0_25
Spark: 1.1.0
spark-streaming-kinesis-asl: 1.1.0

A.K.M. Ashrafuzzaman
Lead Software Engineer
NewsCred

(M) 880-175-5592433
Twitter | Blog | Facebook

Check out The Academy, your #1 source
for free content marketing resources



Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Akhil Das
I have it working without any issues (tried with 5 shrads), except my java
version was 1.7.

Here's the piece of code that i used.

  System.setProperty(AWS_ACCESS_KEY_ID,
this.kConf.getOrElse(access_key, ))
System.setProperty(AWS_SECRET_KEY, this.kConf.getOrElse(secret,
))  val streamName = this.kConf.getOrElse(stream, )  val
endpointUrl = 
this.kConf.getOrElse(end_point,https://kinesis.us-east-1.amazonaws.com/;)
 val kinesisClient = new AmazonKinesisClient(new
DefaultAWSCredentialsProviderChain())
kinesisClient.setEndpoint(endpointUrl)  *val numShards =
kinesisClient.describeStream(streamName).getStreamDescription().getShards()
.size()*  val numStreams = numShards  val
kinesisCheckpointInterval = Seconds(this.kConf.getOrElse(duration,
).toInt)  val kinesisStreams = (0 until numStreams).map { i =
 KinesisUtils.createStream(ssc, streamName, endpointUrl,
kinesisCheckpointInterval,  InitialPositionInStream.LATEST,
StorageLevel.MEMORY_AND_DISK_2)  }  /* Union all the streams
*/  val unionStreams = ssc.union(kinesisStreams)  val
tmp_stream = unionStreams.map(byteArray = new String(byteArray))

  tmp_stream.print()




Thanks
Best Regards

On Wed, Nov 26, 2014 at 5:53 PM, A.K.M. Ashrafuzzaman 
ashrafuzzaman...@gmail.com wrote:

 Hi guys,
 When we are using Kinesis with 1 shard then it works fine. But when we use
 more that 1 then it falls into an infinite loop and no data is processed by
 the spark streaming. In the kinesis dynamo DB, I can see that it keeps
 increasing the leaseCounter. But it do start processing.

 I am using,
 scala: 2.10.4
 java version: 1.8.0_25
 Spark: 1.1.0
 spark-streaming-kinesis-asl: 1.1.0

 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred http://www.newscred.com/

 (M) 880-175-5592433
 Twitter https://twitter.com/ashrafuzzaman | Blog
 http://jitu-blog.blogspot.com/ | Facebook
 https://www.facebook.com/ashrafuzzaman.jitu

 Check out The Academy http://newscred.com/theacademy, your #1 source
 for free content marketing resources




Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Aniket Bhatnagar
What's your cluster size? For streamig to work, it needs shards + 1
executors.

On Wed, Nov 26, 2014, 5:53 PM A.K.M. Ashrafuzzaman 
ashrafuzzaman...@gmail.com wrote:

 Hi guys,
 When we are using Kinesis with 1 shard then it works fine. But when we use
 more that 1 then it falls into an infinite loop and no data is processed by
 the spark streaming. In the kinesis dynamo DB, I can see that it keeps
 increasing the leaseCounter. But it do start processing.

 I am using,
 scala: 2.10.4
 java version: 1.8.0_25
 Spark: 1.1.0
 spark-streaming-kinesis-asl: 1.1.0

 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred http://www.newscred.com/

 (M) 880-175-5592433
 Twitter https://twitter.com/ashrafuzzaman | Blog
 http://jitu-blog.blogspot.com/ | Facebook
 https://www.facebook.com/ashrafuzzaman.jitu

 Check out The Academy http://newscred.com/theacademy, your #1 source
 for free content marketing resources




Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Aniket Bhatnagar
Did you set spark master as local[*]? If so, then it means that nunber of
executors is equal to number of cores of the machine. Perhaps your mac
machine has more cores (certainly more than number of kinesis shards +1).

Try explicitly setting master as local[N] where N is number of kinesis
shards + 1. It should then work on both the machines.

On Thu, Nov 27, 2014, 9:46 AM Ashrafuzzaman ashrafuzzaman...@gmail.com
wrote:

 I was trying in one machine with just sbt run.

 And it is working with my mac environment with the same configuration.

 I used the sample code from
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala


 val kinesisClient = new AmazonKinesisClient(new
 DefaultAWSCredentialsProviderChain())
 kinesisClient.setEndpoint(endpointUrl)
 val numShards =
 kinesisClient.describeStream(streamName).getStreamDescription().getShards()
   .size()

 /* In this example, we're going to create 1 Kinesis
 Worker/Receiver/DStream for each shard. */
 val numStreams = numShards

 /* Setup the and SparkConfig and StreamingContext */
 /* Spark Streaming batch interval */
 val batchInterval = Milliseconds(2000)
 val sparkConfig = new SparkConf().setAppName(KinesisWordCount)
 val ssc = new StreamingContext(sparkConfig, batchInterval)

 /* Kinesis checkpoint interval.  Same as batchInterval for this example. */
 val kinesisCheckpointInterval = batchInterval

 /* Create the same number of Kinesis DStreams/Receivers as Kinesis
 stream's shards */
 val kinesisStreams = (0 until numStreams).map { i =
   KinesisUtils.createStream(ssc, streamName, endpointUrl,
 kinesisCheckpointInterval,
   InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2)
 }

 /* Union all the streams */
 val unionStreams = ssc.union(kinesisStreams)

 /* Convert each line of Array[Byte] to String, split into words, and count
 them */
 val words = unionStreams.flatMap(byteArray = new String(byteArray)
   .split( ))

 /* Map each word to a (word, 1) tuple so we can reduce/aggregate by key. */
 val wordCounts = words.map(word = (word, 1)).reduceByKey(_ + _)

 /* Print the first 10 wordCounts */
 wordCounts.print()

 /* Start the streaming context and await termination */
 ssc.start()
 ssc.awaitTermination()



 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred http://www.newscred.com

 (M) 880-175-5592433
 Twitter https://twitter.com/ashrafuzzaman | Blog
 http://jitu-blog.blogspot.com/ | Facebook
 https://www.facebook.com/ashrafuzzaman.jitu

 Check out The Academy http://newscred.com/theacademy, your #1 source
 for free content marketing resources

 On Wed, Nov 26, 2014 at 11:26 PM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 What's your cluster size? For streamig to work, it needs shards + 1
 executors.

 On Wed, Nov 26, 2014, 5:53 PM A.K.M. Ashrafuzzaman 
 ashrafuzzaman...@gmail.com wrote:

 Hi guys,
 When we are using Kinesis with 1 shard then it works fine. But when we
 use more that 1 then it falls into an infinite loop and no data is
 processed by the spark streaming. In the kinesis dynamo DB, I can see that
 it keeps increasing the leaseCounter. But it do start processing.

 I am using,
 scala: 2.10.4
 java version: 1.8.0_25
 Spark: 1.1.0
 spark-streaming-kinesis-asl: 1.1.0

 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred http://www.newscred.com/

 (M) 880-175-5592433
 Twitter https://twitter.com/ashrafuzzaman | Blog
 http://jitu-blog.blogspot.com/ | Facebook
 https://www.facebook.com/ashrafuzzaman.jitu

 Check out The Academy http://newscred.com/theacademy, your #1 source
 for free content marketing resources





Spark Streaming with Kinesis

2014-10-29 Thread Harold Nguyen
Hi all,

I followed the guide here:
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

But got this error:
Exception in thread main java.lang.NoClassDefFoundError:
com/amazonaws/auth/AWSCredentialsProvider

Would you happen to know what dependency or jar is needed ?

Harold


Re: Spark Streaming with Kinesis

2014-10-29 Thread Harold Nguyen
Hi again,

After getting through several dependencies, I finally got to this
non-dependency type error:

Exception in thread main java.lang.NoSuchMethodError:
org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V

It look every similar to this post:

http://stackoverflow.com/questions/24788949/nosuchmethoderror-while-running-aws-s3-client-on-spark-while-javap-shows-otherwi

Since I'm a little new to everything, would someone be able to provide a
step-by-step guidance for that ?

Harold

On Wed, Oct 29, 2014 at 9:22 AM, Harold Nguyen har...@nexgate.com wrote:

 Hi all,

 I followed the guide here:
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

 But got this error:
 Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider

 Would you happen to know what dependency or jar is needed ?

 Harold




Re: Spark Streaming with Kinesis

2014-10-29 Thread Matt Chu
I haven't tried this myself yet, but this sounds relevant:

https://github.com/apache/spark/pull/2535

Will be giving this a try today or so, will report back.

On Wednesday, October 29, 2014, Harold Nguyen har...@nexgate.com wrote:

 Hi again,

 After getting through several dependencies, I finally got to this
 non-dependency type error:

 Exception in thread main java.lang.NoSuchMethodError:
 org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V

 It look every similar to this post:


 http://stackoverflow.com/questions/24788949/nosuchmethoderror-while-running-aws-s3-client-on-spark-while-javap-shows-otherwi

 Since I'm a little new to everything, would someone be able to provide a
 step-by-step guidance for that ?

 Harold

 On Wed, Oct 29, 2014 at 9:22 AM, Harold Nguyen har...@nexgate.com
 javascript:_e(%7B%7D,'cvml','har...@nexgate.com'); wrote:

 Hi all,

 I followed the guide here:
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

 But got this error:
 Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider

 Would you happen to know what dependency or jar is needed ?

 Harold