Data ingestion into elastic failing using pyspark

2024-03-11 Thread Karthick Nk
Hi @all,

I am using pyspark program to write the data into elastic index by using
upsert operation (sample code snippet below).

def writeDataToES(final_df):
write_options = {
"es.nodes":  elastic_host,
"es.net.ssl": "false",
"es.nodes.wan.only": "true",
"es.net.http.auth.user": elastic_user_name,
"es.net.http.auth.pass": elastic_password,
"es.port": elastic_port,
"es.net.ssl": "true",
'es.spark.dataframe.write.null': "true",
"es.mapping.id" : mapping_id,
"es.write.operation": "upsert"
}
final_df.write.format(
"org.elasticsearch.spark.sql").options(**write_options).mode("append").save(
f"{index_name}")


while writing data from delta table to elastic index, i am getting error
for few records(error message below)

*Py4JJavaError: An error occurred while calling o1305.save.*
*: org.apache.spark.SparkException: Job aborted due to stage failure: Task
4 in stage 524.0 failed 4 times, most recent failure: Lost task 4.3 in
stage 524.0 (TID 12805) (192.168.128.16 executor 1):
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for
bulk operation [1/1]. Error sample (first [5] error messages):*
* org.elasticsearch.hadoop.rest.EsHadoopRemoteException:
illegal_argument_exception: Illegal group reference: group index is missing*

Could you guide me on it, am I missing anythings,

If you require more additional details, please let me know.

Thanks


Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread Mich Talebzadeh
Hi Ashok,

Thanks for pointing out the databricks article Scalable Spark Structured
Streaming for REST API Destinations | Databricks Blog
<https://www.databricks.com/blog/scalable-spark-structured-streaming-rest-api-destinations>

I browsed it and it is basically similar to many of us involved with spark
structure streaming with *foreachBatch. *This article and mine both mention
REST API as part of the architecture. However, there are notable
differences I believe.

In my proposed approach:

   1. Event-Driven Model:


   - Spark Streaming waits until Flask REST API makes a request for events
   to be generated within PySpark.
   - Messages are generated and then fed into any sink based on the Flask
   REST API's request.
   - This creates a more event-driven model where Spark generates data when
   prompted by external requests.





In the Databricks article scenario:

Continuous Data Stream:

   - There is an incoming stream of data from sources like Kafka, AWS
   Kinesis, or Azure Event Hub handled by foreachBatch
   - As messages flow off this stream, calls are made to a REST API with
   some or all of the message data.
   - This suggests a continuous flow of data where messages are sent to a
   REST API as soon as they are available in the streaming source.


*Benefits of Event-Driven Model:*


   1. Responsiveness: Ideal for scenarios where data generation needs to be
   aligned with specific events or user actions.
   2. Resource Optimization: Can reduce resource consumption by processing
   data only when needed.
   3. Flexibility: Allows for dynamic control over data generation based on
   external triggers.

*Benefits of Continuous Data Stream Mode with foreachBatch:*

   1. Real-Time Processing: Facilitates immediate analysis and action on
   incoming data.
   2. Handling High Volumes: Well-suited for scenarios with
   continuous, high-volume data streams.
   3. Low-Latency Applications: Essential for applications requiring near
   real-time responses.

*Potential Use Cases for my approach:*

   - On-Demand Data Generation: Generating data for
   simulations, reports, or visualizations based on user requests.
   - Triggered Analytics: Executing specific analytics tasks only when
   certain events occur, such as detecting anomalies or reaching thresholds
   say fraud detection.
   - Custom ETL Processes: Facilitating data
   extraction, transformation, and loading workflows based on external events
   or triggers


Something to note on latency. Event-driven models like mine can potentially
introduce slight latency compared to continuous processing, as data
generation depends on API calls.

So my approach is more event-triggered and responsive to external requests,
while foreachBatch scenario is more continuous and real-time, processing
and sending data as it becomes available.

In summary, both approaches have their merits and are suited to different
use cases depending on the nature of the data flow and processing
requirements.

Cheers

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 9 Jan 2024 at 19:11, ashok34...@yahoo.com 
wrote:

> Hey Mich,
>
> Thanks for this introduction on your forthcoming proposal "Spark
> Structured Streaming and Flask REST API for Real-Time Data Ingestion and
> Analytics". I recently came across an article by Databricks with title 
> Scalable
> Spark Structured Streaming for REST API Destinations
> <https://www.databricks.com/blog/scalable-spark-structured-streaming-rest-api-destinations>
> . Their use case is similar to your suggestion but what they are saying
> is that they have incoming stream of data from sources like Kafka, AWS
> Kinesis, or Azure Event Hub. In other words, a continuous flow of data
> where messages are sent to a REST API as soon as they are available in the
> streaming source. Their approach is practical but wanted to get your
> thoughts on their article with a better understanding on your proposal and
> differences.
>
> Thanks
>
>
> On Tuesday, 9 January 2024 at 00:24:19 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Please also note that Flask, by default, is a single-threaded web
> framework. While it is suitable for development and small-scale
> applications, it may not handle concurrent requests efficiently in a
> production environment.
> In production, one can utilise G

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread ashok34...@yahoo.com.INVALID
 Hey Mich,
Thanks for this introduction on your forthcoming proposal "Spark Structured 
Streaming and Flask REST API for Real-Time Data Ingestion and Analytics". I 
recently came across an article by Databricks with title Scalable Spark 
Structured Streaming for REST API Destinations. Their use case is similar to 
your suggestion but what they are saying is that they have incoming stream of 
data from sources like Kafka, AWS Kinesis, or Azure Event Hub. In other words, 
a continuous flow of data where messages are sent to a REST API as soon as they 
are available in the streaming source. Their approach is practical but wanted 
to get your thoughts on their article with a better understanding on your 
proposal and differences.
Thanks

On Tuesday, 9 January 2024 at 00:24:19 GMT, Mich Talebzadeh 
 wrote:  
 
 Please also note that Flask, by default, is a single-threaded web framework. 
While it is suitable for development and small-scale applications, it may not 
handle concurrent requests efficiently in a production environment.In 
production, one can utilise Gunicorn (Green Unicorn) which is a WSGI ( Web 
Server Gateway Interface) that is commonly used to serve Flask applications in 
production. It provides multiple worker processes, each capable of handling a 
single request at a time. This makes Gunicorn suitable for handling multiple 
simultaneous requests and improves the concurrency and performance of your 
Flask application.

HTH
Mich Talebzadeh,Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Mon, 8 Jan 2024 at 19:30, Mich Talebzadeh  wrote:

Thought it might be useful to share my idea with fellow forum members.  During 
the breaks, I worked on the seamless integration of Spark Structured Streaming 
with Flask REST API for real-time data ingestion and analytics. The use case 
revolves around a scenario where data is generated through REST API requests in 
real time. The Flask REST API efficiently captures and processes this data, 
saving it to a Spark Structured Streaming DataFrame. Subsequently, the 
processed data could be channelled into any sink of your choice including Kafka 
pipeline, showing a robust end-to-end solution for dynamic and responsive data 
streaming. I will delve into the architecture, implementation, and benefits of 
this combination, enabling one to build an agile and efficient real-time data 
application. I will put the code in GitHub for everyone's benefit. Hopefully 
your comments will help me to improve it.
Cheers
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 

  

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh
Please also note that Flask, by default, is a single-threaded web
framework. While it is suitable for development and small-scale
applications, it may not handle concurrent requests efficiently in a
production environment.
In production, one can utilise Gunicorn (Green Unicorn) which is a WSGI (
Web Server Gateway Interface) that is commonly used to serve Flask
applications in production. It provides multiple worker processes, each
capable of handling a single request at a time. This makes Gunicorn
suitable for handling multiple simultaneous requests and improves the
concurrency and performance of your Flask application.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 8 Jan 2024 at 19:30, Mich Talebzadeh 
wrote:

> Thought it might be useful to share my idea with fellow forum members.  During
> the breaks, I worked on the *seamless integration of Spark Structured
> Streaming with Flask REST API for real-time data ingestion and analytics*.
> The use case revolves around a scenario where data is generated through
> REST API requests in real time. The Flask REST AP
> <https://en.wikipedia.org/wiki/Flask_(web_framework)>I efficiently
> captures and processes this data, saving it to a Spark Structured Streaming
> DataFrame. Subsequently, the processed data could be channelled into any
> sink of your choice including Kafka pipeline, showing a robust end-to-end
> solution for dynamic and responsive data streaming. I will delve into the
> architecture, implementation, and benefits of this combination, enabling
> one to build an agile and efficient real-time data application. I will put
> the code in GitHub for everyone's benefit. Hopefully your comments will
> help me to improve it.
>
> Cheers
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh
Thought it might be useful to share my idea with fellow forum members.  During
the breaks, I worked on the *seamless integration of Spark Structured
Streaming with Flask REST API for real-time data ingestion and analytics*.
The use case revolves around a scenario where data is generated through
REST API requests in real time. The Flask REST AP
<https://en.wikipedia.org/wiki/Flask_(web_framework)>I efficiently captures
and processes this data, saving it to a Spark Structured Streaming
DataFrame. Subsequently, the processed data could be channelled into any
sink of your choice including Kafka pipeline, showing a robust end-to-end
solution for dynamic and responsive data streaming. I will delve into the
architecture, implementation, and benefits of this combination, enabling
one to build an agile and efficient real-time data application. I will put
the code in GitHub for everyone's benefit. Hopefully your comments will
help me to improve it.

Cheers

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Data ingestion

2022-08-17 Thread Pasha Finkelshtein
But not in streaming, right? It will be a usual batch approach, but initial
question was about streaming.


[image: facebook] <https://fb.com/asm0dey>
[image: twitter] <https://twitter.com/asm0di0>
[image: linkedin] <https://linkedin.com/in/asm0dey>
[image: instagram] <https://instagram.com/asm0dey>

Pasha Finkelshteyn

Developer Advocate for Data Engineering

JetBrains



asm0...@jetbrains.com
https://linktr.ee/asm0dey

Find out more <https://jetbrains.com>



чт, 18 авг. 2022 г. в 03:12, pengyh :

> from my experience, spark can read/write from/to both mysql and hive
> fluently.
>
> regards.
>
>
> Akash Vellukai wrote:
> > How we could do data ingestion from MySQL to Hive with the help of Spark
> > streaming and not with Kafka
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Data ingestion

2022-08-17 Thread pengyh
from my experience, spark can read/write from/to both mysql and hive 
fluently.


regards.


Akash Vellukai wrote:
How we could do data ingestion from MySQL to Hive with the help of Spark 
streaming and not with Kafka


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Data ingestion

2022-08-17 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
If you are on aws, you can use RDS + AWS DMS to save data to s3 and then read 
streaming data with spark structured streaming from s3 into hive

Best regards

> On 17 Aug 2022, at 20:51, Akash Vellukai  wrote:
> 
> 
> Dear Sir, 
> 
> 
> How we could do data ingestion from MySQL to Hive with the help of Spark 
> streaming and not with Kafka
> 
> Thanks and regards
> Akash

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Data ingestion

2022-08-17 Thread Pasha Finkelshtein
Hello

Spark does not have any built-in solution for this problem. Most probably
you will want to use Debezium+Kafka and read with Spark from Kafka


[image: facebook] <https://fb.com/asm0dey>
[image: twitter] <https://twitter.com/asm0di0>
[image: linkedin] <https://linkedin.com/in/asm0dey>
[image: instagram] <https://instagram.com/asm0dey>

Pasha Finkelshteyn

Developer Advocate for Data Engineering

JetBrains



asm0...@jetbrains.com
https://linktr.ee/asm0dey

Find out more <https://jetbrains.com>



ср, 17 авг. 2022 г. в 19:51, Akash Vellukai :

> Dear Sir,
>
>
> How we could do data ingestion from MySQL to Hive with the help of Spark
> streaming and not with Kafka
>
> Thanks and regards
> Akash
>


Data ingestion

2022-08-17 Thread Akash Vellukai
Dear Sir,


How we could do data ingestion from MySQL to Hive with the help of Spark
streaming and not with Kafka

Thanks and regards
Akash


Re: [EXTERNAL] Re: Spark streaming - Data Ingestion

2022-08-17 Thread Akash Vellukai
I am beginner with spark may , also know how to connect MySQL database with
spark streaming

Thanks and regards
Akash P

On Wed, 17 Aug, 2022, 8:28 pm Saurabh Gulati, 
wrote:

> Another take:
>
>- Debezium
><https://debezium.io/documentation/reference/stable/connectors/mysql.html>
>to read Write Ahead logs(WAL) and send to Kafka
>- Kafka connect to write to cloud storage -> Hive
>   - OR
>
>
>- Spark streaming to parse WAL -> Storage -> Hive
>
> Regards
> --
> *From:* Gibson 
> *Sent:* 17 August 2022 16:53
> *To:* Akash Vellukai 
> *Cc:* user@spark.apache.org 
> *Subject:* [EXTERNAL] Re: Spark streaming - Data Ingestion
>
> *Caution! This email originated outside of FedEx. Please do not open
> attachments or click links from an unknown or suspicious origin*.
> If you have space for a message log like, then you should try:
>
> MySQL -> Kafka (via CDC) -> Spark (Structured Streaming) -> HDFS/S3/ADLS
> -> Hive
>
> On Wed, Aug 17, 2022 at 5:40 PM Akash Vellukai 
> wrote:
>
> Dear sir
>
> I have tried a lot on this could you help me with this?
>
> Data ingestion from MySql to Hive with spark- streaming?
>
> Could you give me an overview.
>
>
> Thanks and regards
> Akash P
>
>


Re: [EXTERNAL] Re: Spark streaming - Data Ingestion

2022-08-17 Thread Gibson
The idea behind spark-streaming is to process change events as they occur,
hence the suggestions above that require capturing change events using
Debezium.

But you can use jdbc drivers to connect Spark to relational databases


On Wed, Aug 17, 2022 at 6:21 PM Akash Vellukai 
wrote:

> I am beginner with spark may , also know how to connect MySQL database
> with spark streaming
>
> Thanks and regards
> Akash P
>
> On Wed, 17 Aug, 2022, 8:28 pm Saurabh Gulati, 
> wrote:
>
>> Another take:
>>
>>- Debezium
>><https://debezium.io/documentation/reference/stable/connectors/mysql.html>
>>to read Write Ahead logs(WAL) and send to Kafka
>>- Kafka connect to write to cloud storage -> Hive
>>   - OR
>>
>>
>>- Spark streaming to parse WAL -> Storage -> Hive
>>
>> Regards
>> --
>> *From:* Gibson 
>> *Sent:* 17 August 2022 16:53
>> *To:* Akash Vellukai 
>> *Cc:* user@spark.apache.org 
>> *Subject:* [EXTERNAL] Re: Spark streaming - Data Ingestion
>>
>> *Caution! This email originated outside of FedEx. Please do not open
>> attachments or click links from an unknown or suspicious origin*.
>> If you have space for a message log like, then you should try:
>>
>> MySQL -> Kafka (via CDC) -> Spark (Structured Streaming) -> HDFS/S3/ADLS
>> -> Hive
>>
>> On Wed, Aug 17, 2022 at 5:40 PM Akash Vellukai <
>> akashvellukai...@gmail.com> wrote:
>>
>> Dear sir
>>
>> I have tried a lot on this could you help me with this?
>>
>> Data ingestion from MySql to Hive with spark- streaming?
>>
>> Could you give me an overview.
>>
>>
>> Thanks and regards
>> Akash P
>>
>>


Re: [EXTERNAL] Re: Spark streaming - Data Ingestion

2022-08-17 Thread Saurabh Gulati
Another take:

  *   
Debezium<https://debezium.io/documentation/reference/stable/connectors/mysql.html>
 to read Write Ahead logs(WAL) and send to Kafka
  *   Kafka connect to write to cloud storage -> Hive
 *   OR

  *   Spark streaming to parse WAL -> Storage -> Hive

Regards

From: Gibson 
Sent: 17 August 2022 16:53
To: Akash Vellukai 
Cc: user@spark.apache.org 
Subject: [EXTERNAL] Re: Spark streaming - Data Ingestion

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

If you have space for a message log like, then you should try:

MySQL -> Kafka (via CDC) -> Spark (Structured Streaming) -> HDFS/S3/ADLS -> Hive

On Wed, Aug 17, 2022 at 5:40 PM Akash Vellukai 
mailto:akashvellukai...@gmail.com>> wrote:
Dear sir

I have tried a lot on this could you help me with this?

Data ingestion from MySql to Hive with spark- streaming?

Could you give me an overview.


Thanks and regards
Akash P


Re: Spark streaming - Data Ingestion

2022-08-17 Thread Gibson
If you have space for a message log like, then you should try:

MySQL -> Kafka (via CDC) -> Spark (Structured Streaming) -> HDFS/S3/ADLS ->
Hive

On Wed, Aug 17, 2022 at 5:40 PM Akash Vellukai 
wrote:

> Dear sir
>
> I have tried a lot on this could you help me with this?
>
> Data ingestion from MySql to Hive with spark- streaming?
>
> Could you give me an overview.
>
>
> Thanks and regards
> Akash P
>


Spark streaming - Data Ingestion

2022-08-17 Thread Akash Vellukai
Dear sir

I have tried a lot on this could you help me with this?

Data ingestion from MySql to Hive with spark- streaming?

Could you give me an overview.


Thanks and regards
Akash P


Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-21 Thread Aakash Basu
Yes, I did the same. It's working. Thanks!

On 21-Nov-2017 4:04 PM, "Fernando Pereira"  wrote:

> Did you consider do string processing to build the SQL expression which
> you can execute with spark.sql(...)?
> Some examples: https://spark.apache.org/docs/latest/sql-
> programming-guide.html#hive-tables
>
> Cheers
>
> On 21 November 2017 at 03:27, Aakash Basu 
> wrote:
>
>> Hi all,
>>
>> Any help? PFB.
>>
>> Thanks,
>> Aakash.
>>
>> On 20-Nov-2017 6:58 PM, "Aakash Basu"  wrote:
>>
>>> Hi all,
>>>
>>> I have a table which will have 4 columns -
>>>
>>> |  Expression|filter_condition| from_clause|
>>> group_by_columns|
>>>
>>>
>>> This file may have variable number of rows depending on the no. of KPIs
>>> I need to calculate.
>>>
>>> I need to write a SparkSQL program which will have to read this file and
>>> run each line of queries dynamically by fetching each column value for a
>>> particular row and create a select query out of it and run inside a
>>> dataframe, later saving it as a temporary table.
>>>
>>> Did anyone do this kind of exercise? If yes, can I get some help on it
>>> pls?
>>>
>>> Thanks,
>>> Aakash.
>>>
>>
>


Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-21 Thread Fernando Pereira
Did you consider do string processing to build the SQL expression which you
can execute with spark.sql(...)?
Some examples:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Cheers

On 21 November 2017 at 03:27, Aakash Basu 
wrote:

> Hi all,
>
> Any help? PFB.
>
> Thanks,
> Aakash.
>
> On 20-Nov-2017 6:58 PM, "Aakash Basu"  wrote:
>
>> Hi all,
>>
>> I have a table which will have 4 columns -
>>
>> |  Expression|filter_condition| from_clause|
>> group_by_columns|
>>
>>
>> This file may have variable number of rows depending on the no. of KPIs I
>> need to calculate.
>>
>> I need to write a SparkSQL program which will have to read this file and
>> run each line of queries dynamically by fetching each column value for a
>> particular row and create a select query out of it and run inside a
>> dataframe, later saving it as a temporary table.
>>
>> Did anyone do this kind of exercise? If yes, can I get some help on it
>> pls?
>>
>> Thanks,
>> Aakash.
>>
>


Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu
Hi all,

Any help? PFB.

Thanks,
Aakash.

On 20-Nov-2017 6:58 PM, "Aakash Basu"  wrote:

> Hi all,
>
> I have a table which will have 4 columns -
>
> |  Expression|filter_condition| from_clause|
> group_by_columns|
>
>
> This file may have variable number of rows depending on the no. of KPIs I
> need to calculate.
>
> I need to write a SparkSQL program which will have to read this file and
> run each line of queries dynamically by fetching each column value for a
> particular row and create a select query out of it and run inside a
> dataframe, later saving it as a temporary table.
>
> Did anyone do this kind of exercise? If yes, can I get some help on it pls?
>
> Thanks,
> Aakash.
>


Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu
Hi all,

I have a table which will have 4 columns -

|  Expression|filter_condition| from_clause|
group_by_columns|


This file may have variable number of rows depending on the no. of KPIs I
need to calculate.

I need to write a SparkSQL program which will have to read this file and
run each line of queries dynamically by fetching each column value for a
particular row and create a select query out of it and run inside a
dataframe, later saving it as a temporary table.

Did anyone do this kind of exercise? If yes, can I get some help on it pls?

Thanks,
Aakash.


Re: jdbcRDD for data ingestion from RDBMS

2016-10-18 Thread Mich Talebzadeh
Hi,

If we are talking about billions of records and depending on your network
and RDBMs with parallel connections, from my experience it works OK for
Dimension tables of moderate size, in that you can have parallel
connections to RDBMS (assuming the RDBMS has a primary key/unique column)
to parallelise the process and read data  "as is" in Spark using JDBC
connections.

However the other alternative is to get data into HDFS using Sqoop or even
Spark.

The third option is to use bulk copy to get the data out of RDBMS table
into a directory (csv type), scp it into HDFS host and put it into HDFS and
then you can access it though Hive external tables etc.

A real time load of data using Spark JDBC makes sense if the RDBMS table
itself is pretty small. For most dimension tables should satisfy this. This
approach is not advisable for FACT tables.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 October 2016 at 10:35, Teng Qiu  wrote:

> Hi Ninad, i believe the purpose of jdbcRDD is to use RDBMS as an addtional
> data source during the data processing, main goal of spark is still
> analyzing data from HDFS-like file system.
>
> to use spark as a data integration tool to transfer billions of records
> from RDBMS to HDFS etc. could work, but may not be the best tool... Sqoop
> with --direct sounds better, but the configuration costs, sqoop should be
> used for regular data integration tasks.
>
> not sure if your client need transfer billions of records periodically, if
> it is only an initial load, for such an one-off task, maybe a bash script
> with COPY command is more easier and faster :)
>
> Best,
>
> Teng
>
>
> 2016-10-18 4:24 GMT+02:00 Ninad Shringarpure :
>
>>
>> Hi Team,
>>
>> One of my client teams is trying to see if they can use Spark to source
>> data from RDBMS instead of Sqoop.  Data would be substantially large in the
>> order of billions of records.
>>
>> I am not sure reading the documentations whether jdbcRDD by design is
>> going to be able to scale well for this amount of data. Plus some in-built
>> features provided in Sqoop like --direct might give better performance than
>> straight up jdbc.
>>
>> My primary question to this group is if it is advisable to use jdbcRDD
>> for data sourcing and can we expect it to scale. Also performance wise how
>> would it compare to Sqoop.
>>
>> Please let me know your thoughts and any pointers if anyone in the group
>> has already implemented it.
>>
>> Thanks,
>> Ninad
>>
>>
>


Re: jdbcRDD for data ingestion from RDBMS

2016-10-18 Thread Teng Qiu
Hi Ninad, i believe the purpose of jdbcRDD is to use RDBMS as an addtional
data source during the data processing, main goal of spark is still
analyzing data from HDFS-like file system.

to use spark as a data integration tool to transfer billions of records
from RDBMS to HDFS etc. could work, but may not be the best tool... Sqoop
with --direct sounds better, but the configuration costs, sqoop should be
used for regular data integration tasks.

not sure if your client need transfer billions of records periodically, if
it is only an initial load, for such an one-off task, maybe a bash script
with COPY command is more easier and faster :)

Best,

Teng


2016-10-18 4:24 GMT+02:00 Ninad Shringarpure :

>
> Hi Team,
>
> One of my client teams is trying to see if they can use Spark to source
> data from RDBMS instead of Sqoop.  Data would be substantially large in the
> order of billions of records.
>
> I am not sure reading the documentations whether jdbcRDD by design is
> going to be able to scale well for this amount of data. Plus some in-built
> features provided in Sqoop like --direct might give better performance than
> straight up jdbc.
>
> My primary question to this group is if it is advisable to use jdbcRDD for
> data sourcing and can we expect it to scale. Also performance wise how
> would it compare to Sqoop.
>
> Please let me know your thoughts and any pointers if anyone in the group
> has already implemented it.
>
> Thanks,
> Ninad
>
>


Fwd: jdbcRDD for data ingestion from RDBMS

2016-10-17 Thread Ninad Shringarpure
Hi Team,

One of my client teams is trying to see if they can use Spark to source
data from RDBMS instead of Sqoop.  Data would be substantially large in the
order of billions of records.

I am not sure reading the documentations whether jdbcRDD by design is going
to be able to scale well for this amount of data. Plus some in-built
features provided in Sqoop like --direct might give better performance than
straight up jdbc.

My primary question to this group is if it is advisable to use jdbcRDD for
data sourcing and can we expect it to scale. Also performance wise how
would it compare to Sqoop.

Please let me know your thoughts and any pointers if anyone in the group
has already implemented it.

Thanks,
Ninad


Re: Schedule lunchtime today for a free webinar "IoT data ingestion in Spark Streaming using Kaa" 11 a.m. PDT (2 p.m. EDT)

2015-08-04 Thread orozvadovskyy
Hi there! 

If you missed our webinar on "IoT data ingestion in Spark with KaaIoT", see the 
video and slides here: http://goo.gl/VMyQ1M 

We recorded our webinar on “IoT data ingestion in Spark Streaming using Kaa” 
for those who couldn’t see it live or who would like to refresh what they have 
learned. During the webinar, we explained and illustrated how Kaa and Spark can 
be effectively used together to address the challenges of IoT data gathering 
and analysis. In this video, you will find highly crystallized, practical 
instruction on setting up your own stream analytics solution with Kaa and 
Spark. 

Best wishes, 
Oleh Rozvadovskyy 
CyberVision Inc 

- Вихідне повідомлення -

Від: "Oleh Rozvadovskyy"  
Кому: user@spark.apache.org 
Надіслано: Четвер, 23 Липень 2015 р 17:48:11 
Тема: Schedule lunchtime today for a free webinar "IoT data ingestion in Spark 
Streaming using Kaa" 11 a.m. PDT (2 p.m. EDT) 

Hi there! 

Only couple of hours left to our first webinar on IoT data ingestion in Spark 
Streaming using Kaa . 



During the webinar we will build a solution that ingests real-time data from 
Intel Edison into Apache Spark for stream processing. This solution includes a 
client, middleware, and analytics components. All of these software components 
are 100% open-source, therefore, the solution described in this tutorial can be 
used as a prototype for even a commercial product. 

Those, who are interested, please feel free to sign up here . 

Best wishes, 
Oleh Rozvadovskyy 
CyberVision Inc. 

​ 



Schedule lunchtime today for a free webinar "IoT data ingestion in Spark Streaming using Kaa" 11 a.m. PDT (2 p.m. EDT)

2015-07-23 Thread Oleh Rozvadovskyy
Hi there!

Only couple of hours left to our first webinar on* IoT data ingestion in
Spark Streaming using Kaa*.



During the webinar we will build a solution that ingests real-time data
from Intel Edison into Apache Spark for stream processing. This solution
includes a client, middleware, and analytics components. All of these
software components are 100% open-source, therefore, the solution described
in this tutorial can be used as a prototype for even a commercial product.

Those, who are interested, please feel free to sign up here
<https://goo.gl/rgWuj6>.

Best wishes,
Oleh Rozvadovskyy
CyberVision Inc.

​


Re: How to speed up data ingestion with Spark

2015-05-12 Thread Akhil Das
This article <http://www.virdata.com/tuning-spark/> gives you a pretty good
start on the Spark streaming side. And this article
<https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines>
is for the kafka, it has nice explanation how message size and partitions
effects the throughput. And this article
<https://www.sigmoid.com/creating-sigview-a-real-time-analytics-dashboard/>
has a use-case.

Thanks
Best Regards

On Tue, May 12, 2015 at 8:25 PM, dgoldenberg 
wrote:

> Hi,
>
> I'm looking at a data ingestion implementation which streams data out of
> Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to
> process the data in each partition.  Have folks looked at ways of speeding
> up this type of ingestion?
>
> Let's say the main part of the ingest process is fetching documents from
> somewhere and performing text extraction on them. Is this type of
> processing
> best done by expressing the pipelining with Spark RDD transformations or by
> just kicking off a multi-threaded pipeline?
>
> Or, is using a multi-threaded pipeliner per partition is a decent strategy
> and the performance comes from running in a clustered mode?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-speed-up-data-ingestion-with-Spark-tp22859.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


How to speed up data ingestion with Spark

2015-05-12 Thread dgoldenberg
Hi,

I'm looking at a data ingestion implementation which streams data out of
Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to
process the data in each partition.  Have folks looked at ways of speeding
up this type of ingestion?

Let's say the main part of the ingest process is fetching documents from
somewhere and performing text extraction on them. Is this type of processing
best done by expressing the pipelining with Spark RDD transformations or by
just kicking off a multi-threaded pipeline?

Or, is using a multi-threaded pipeliner per partition is a decent strategy
and the performance comes from running in a clustered mode?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-speed-up-data-ingestion-with-Spark-tp22859.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org