Json Dataframe formation and Querying

2015-07-01 Thread Chaudhary, Umesh
Hi,
I am creating DataFrame from a json file and the schema of json as truely 
depicted by dataframe.printschema() is:

root
|-- 1-F2: struct (nullable = true)
||-- A: string (nullable = true)
||-- B: string (nullable = true)
||-- C: string (nullable = true)
|-- 10-C4: struct (nullable = true)
||-- A: string (nullable = true)
||-- D: string (nullable = true)
||-- E: string (nullable = true)
|-- 11-B5: struct (nullable = true)
||-- A: string (nullable = true)
||-- D: string (nullable = true)
||-- F: string (nullable = true)
||-- G: string (nullable = true)

In the above schema ; struct type elements {1-F2 ; 10-C4; 11-B5 } are dynamic. 
These kind of dynamic schema can be easily parsed by any parser (e.g. gson, 
jackson) and Map type structure makes it easy to query back and transform but 
in Spark 1.4 how should I query back using construct like :

dataframe.select([0]).show()  --> Index based query

I tried to save it as Table and then tried to describe it back using spark-sql 
repl but it is unable to find my table.

What is the preferred way to deal with this type of use case in Spark?

Regards,
Umesh Chaudhary

This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.


RE: Optimizing Streaming from Websphere MQ

2015-06-16 Thread Chaudhary, Umesh
Thanks Akhil for taking this point, I am also talking about the MQ bottleneck.
I am currently having 5 receivers for a unreliable Websphere MQ receiver 
implementations.
Is there any proven way to convert this implementation to reliable one ?


Regards,
Umesh Chaudhary
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, June 16, 2015 12:44 PM
To: Chaudhary, Umesh
Cc: user@spark.apache.org
Subject: Re: Optimizing Streaming from Websphere MQ

Each receiver will run on 1 core. So if your network is not the bottleneck then 
to test the consumption speed of the receivers you can simply do a 
dstream.count.print to see how many records it can receive. (Also it will be 
available in the Streaming tab of the driver UI). If you spawn 10 receivers on 
10 cores then possibly no processing will happen other than receiving.
Now, on the other hand the MQ can also be the bottleneck (you could possibly 
configure it to achieve more parallelism)

Thanks
Best Regards

On Mon, Jun 15, 2015 at 2:40 PM, Chaudhary, Umesh 
mailto:umesh.chaudh...@searshc.com>> wrote:
Hi Akhil,
Thanks for your response.
I have 10 cores which sums of all my 3 machines and I am having 5-10 receivers.
I have tried to test the processed number of records per second by varying 
number of receivers.
If I am having 10 receivers (i.e. one receiver for each core), then I am not 
experiencing any performance benefit from it.
Is it something related to the bottleneck of MQ or Reliable Receiver?

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: Saturday, June 13, 2015 1:10 AM
To: Chaudhary, Umesh
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Optimizing Streaming from Websphere MQ

How many cores are you allocating for your job? And how many receivers are you 
having? It would be good if you can post your custom receiver code, it will 
help people to understand it better and shed some light.

Thanks
Best Regards

On Fri, Jun 12, 2015 at 12:58 PM, Chaudhary, Umesh 
mailto:umesh.chaudh...@searshc.com>> wrote:
Hi,
I have created a Custom Receiver in Java which receives data from Websphere MQ 
and I am only writing the received records on HDFS.

I have referred many forums for optimizing speed of spark streaming 
application. Here I am listing a few:


• Spark 
Official<http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning>

• VIrdata<http://www.virdata.com/tuning-spark/>

•  TD’s Slide (A bit Old but 
Useful)<http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617>

I got mainly two point for my applicability :


• giving batch interval as 1 sec

• Controlling “spark.streaming.blockInterval” =200ms

• inputStream.repartition(3)

But that did not improve my actual speed (records/sec) of receiver which is MAX 
5-10 records /sec. This is way less from my expectation.
Am I missing something?

Regards,
Umesh Chaudhary
This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.

This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.


This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.


RE: Optimizing Streaming from Websphere MQ

2015-06-15 Thread Chaudhary, Umesh
Hi Akhil,
Thanks for your response.
I have 10 cores which sums of all my 3 machines and I am having 5-10 receivers.
I have tried to test the processed number of records per second by varying 
number of receivers.
If I am having 10 receivers (i.e. one receiver for each core), then I am not 
experiencing any performance benefit from it.
Is it something related to the bottleneck of MQ or Reliable Receiver?

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Saturday, June 13, 2015 1:10 AM
To: Chaudhary, Umesh
Cc: user@spark.apache.org
Subject: Re: Optimizing Streaming from Websphere MQ

How many cores are you allocating for your job? And how many receivers are you 
having? It would be good if you can post your custom receiver code, it will 
help people to understand it better and shed some light.

Thanks
Best Regards

On Fri, Jun 12, 2015 at 12:58 PM, Chaudhary, Umesh 
mailto:umesh.chaudh...@searshc.com>> wrote:
Hi,
I have created a Custom Receiver in Java which receives data from Websphere MQ 
and I am only writing the received records on HDFS.

I have referred many forums for optimizing speed of spark streaming 
application. Here I am listing a few:


• Spark 
Official<http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning>

• VIrdata<http://www.virdata.com/tuning-spark/>

•  TD’s Slide (A bit Old but 
Useful)<http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617>

I got mainly two point for my applicability :


• giving batch interval as 1 sec

• Controlling “spark.streaming.blockInterval” =200ms

• inputStream.repartition(3)

But that did not improve my actual speed (records/sec) of receiver which is MAX 
5-10 records /sec. This is way less from my expectation.
Am I missing something?

Regards,
Umesh Chaudhary
This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.


This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.


Optimizing Streaming from Websphere MQ

2015-06-12 Thread Chaudhary, Umesh
Hi,
I have created a Custom Receiver in Java which receives data from Websphere MQ 
and I am only writing the received records on HDFS.

I have referred many forums for optimizing speed of spark streaming 
application. Here I am listing a few:


* Spark 
Official

* VIrdata

*  TD's Slide (A bit Old but 
Useful)

I got mainly two point for my applicability :


* giving batch interval as 1 sec

* Controlling "spark.streaming.blockInterval" =200ms

* inputStream.repartition(3)

But that did not improve my actual speed (records/sec) of receiver which is MAX 
5-10 records /sec. This is way less from my expectation.
Am I missing something?

Regards,
Umesh Chaudhary

This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.


RE: FW: Websphere MQ as a data source for Apache Spark Streaming

2015-06-01 Thread Chaudhary, Umesh
Thanks  for your  suggestion.
Yes by Dstream.SaveAsTextFile();
I was doing a mistake by writing StorageLevel.NULL while overriding the 
storageLevel method in my custom receiver.
When I changed it to StorageLevel.MEMORY_AND_DISK_2() , data started to save at 
disk.
Now it’s running without any issue.


From: Tathagata Das [mailto:t...@databricks.com]
Sent: Friday, May 29, 2015 3:30 AM
To: Chaudhary, Umesh
Cc: Arush Kharbanda; user@spark.apache.org
Subject: Re: FW: Websphere MQ as a data source for Apache Spark Streaming

Are you sure that the data can be saved as strings?
Another, more controlled approach is use DStream.foreachRDD , which takes a 
Function2 parameter, RDD and Time. There you can explicitly do stuff with the 
RDD, save it to separate files (separated by time), or whatever.  Might help 
you to debug what is going on.
Might help if you shows the streaming program in a pastebin.

TD


On Fri, May 29, 2015 at 12:55 AM, Chaudhary, Umesh 
mailto:umesh.chaudh...@searshc.com>> wrote:
Hi,
I have written  manual receiver for Websphere MQ and its working fine.
If I am doing JavaDStream.SaveAsTextFile(“/home/user/out.txt”)  then its 
generating a directory naming out.txt appending its timestamp.
In this directory only _SUCCESS file is present. I can see data on console 
while running in local mode but not able to save it as text file.
Is there any other way for saving streaming data?

From: Chaudhary, Umesh
Sent: Tuesday, May 26, 2015 2:39 AM
To: 'Arush Kharbanda'; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: Websphere MQ as a data source for Apache Spark Streaming

Thanks for the suggestion, I will try and post the outcome.

From: Arush Kharbanda [mailto:ar...@sigmoidanalytics.com]
Sent: Monday, May 25, 2015 12:24 PM
To: Chaudhary, Umesh; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Websphere MQ as a data source for Apache Spark Streaming

Hi Umesh,

You can write a customer receiver for Websphere MQ, using the API for websphere 
MQ.

https://spark.apache.org/docs/latest/streaming-custom-receivers.html

Thanks
Arush

On Mon, May 25, 2015 at 8:04 PM, Chaudhary, Umesh 
mailto:umesh.chaudh...@searshc.com>> wrote:
I have seen it but it has different configuration for connecting the MQ.
I mean for Websphere MQ we need Host, Queue Manager, Channel And Queue Name but 
here according to MQTT protocol

client = new MqttClient(brokerUrl, MqttClient.generateClientId(), persistence)

It only expects Broker URL which is in appropriate for establishing connection 
with Websphere MQ.

Please Suggest !


From: Arush Kharbanda 
[mailto:ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>]
Sent: Monday, May 25, 2015 6:29 AM
To: Chaudhary, Umesh
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Websphere MQ as a data source for Apache Spark Streaming

Hi Umesh,

You can connect to Spark Streaming with MQTT  refer to the example.

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala



Thanks
Arush



On Mon, May 25, 2015 at 3:43 PM, umesh9794 
mailto:umesh.chaudh...@searshc.com>> wrote:
I was digging into the possibilities for Websphere MQ as a data source for
spark-streaming becuase it is needed in one of our use case. I got to know
that  MQTT <http://mqtt.org/>   is the protocol that supports the
communication from MQ data structures but since I am a newbie to spark
streaming I need some working examples for the same. Did anyone try to
connect the MQ with spark streaming. Please devise the best way for doing
so.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Websphere-MQ-as-a-data-source-for-Apache-Spark-Streaming-tp23013.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>



--

[Image removed by sender. Sigmoid 
Analytics]<http://htmlsig.com/www.sigmoidanalytics.com>

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || 
www.sigmoidanalytics.com<http://www.sigmoidanalytics.com/>
This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.



--

[Image removed by sender. Sigmoid 
Analytics]<http://htmlsig.com/www.sigmoidanalytics.com>

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || 
www.sigmoidanalytics.com

FW: Websphere MQ as a data source for Apache Spark Streaming

2015-05-29 Thread Chaudhary, Umesh
Hi,
I have written  manual receiver for Websphere MQ and its working fine.
If I am doing JavaDStream.SaveAsTextFile(“/home/user/out.txt”)  then its 
generating a directory naming out.txt appending its timestamp.
In this directory only _SUCCESS file is present. I can see data on console 
while running in local mode but not able to save it as text file.
Is there any other way for saving streaming data?

From: Chaudhary, Umesh
Sent: Tuesday, May 26, 2015 2:39 AM
To: 'Arush Kharbanda'; user@spark.apache.org
Subject: RE: Websphere MQ as a data source for Apache Spark Streaming

Thanks for the suggestion, I will try and post the outcome.

From: Arush Kharbanda [mailto:ar...@sigmoidanalytics.com]
Sent: Monday, May 25, 2015 12:24 PM
To: Chaudhary, Umesh; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Websphere MQ as a data source for Apache Spark Streaming

Hi Umesh,

You can write a customer receiver for Websphere MQ, using the API for websphere 
MQ.

https://spark.apache.org/docs/latest/streaming-custom-receivers.html

Thanks
Arush

On Mon, May 25, 2015 at 8:04 PM, Chaudhary, Umesh 
mailto:umesh.chaudh...@searshc.com>> wrote:
I have seen it but it has different configuration for connecting the MQ.
I mean for Websphere MQ we need Host, Queue Manager, Channel And Queue Name but 
here according to MQTT protocol

client = new MqttClient(brokerUrl, MqttClient.generateClientId(), persistence)

It only expects Broker URL which is in appropriate for establishing connection 
with Websphere MQ.

Please Suggest !


From: Arush Kharbanda 
[mailto:ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>]
Sent: Monday, May 25, 2015 6:29 AM
To: Chaudhary, Umesh
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Websphere MQ as a data source for Apache Spark Streaming

Hi Umesh,

You can connect to Spark Streaming with MQTT  refer to the example.

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala



Thanks
Arush



On Mon, May 25, 2015 at 3:43 PM, umesh9794 
mailto:umesh.chaudh...@searshc.com>> wrote:
I was digging into the possibilities for Websphere MQ as a data source for
spark-streaming becuase it is needed in one of our use case. I got to know
that  MQTT <http://mqtt.org/>   is the protocol that supports the
communication from MQ data structures but since I am a newbie to spark
streaming I need some working examples for the same. Did anyone try to
connect the MQ with spark streaming. Please devise the best way for doing
so.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Websphere-MQ-as-a-data-source-for-Apache-Spark-Streaming-tp23013.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>



--

[Image removed by sender. Sigmoid 
Analytics]<http://htmlsig.com/www.sigmoidanalytics.com>

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || 
www.sigmoidanalytics.com<http://www.sigmoidanalytics.com/>
This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.



--

[Image removed by sender. Sigmoid 
Analytics]<http://htmlsig.com/www.sigmoidanalytics.com>

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || 
www.sigmoidanalytics.com<http://www.sigmoidanalytics.com/>

This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.


RE: Websphere MQ as a data source for Apache Spark Streaming

2015-05-26 Thread Chaudhary, Umesh
Thanks for the suggestion, I will try and post the outcome.

From: Arush Kharbanda [mailto:ar...@sigmoidanalytics.com]
Sent: Monday, May 25, 2015 12:24 PM
To: Chaudhary, Umesh; user@spark.apache.org
Subject: Re: Websphere MQ as a data source for Apache Spark Streaming

Hi Umesh,

You can write a customer receiver for Websphere MQ, using the API for websphere 
MQ.

https://spark.apache.org/docs/latest/streaming-custom-receivers.html

Thanks
Arush

On Mon, May 25, 2015 at 8:04 PM, Chaudhary, Umesh 
mailto:umesh.chaudh...@searshc.com>> wrote:
I have seen it but it has different configuration for connecting the MQ.
I mean for Websphere MQ we need Host, Queue Manager, Channel And Queue Name but 
here according to MQTT protocol

client = new MqttClient(brokerUrl, MqttClient.generateClientId(), persistence)

It only expects Broker URL which is in appropriate for establishing connection 
with Websphere MQ.

Please Suggest !


From: Arush Kharbanda 
[mailto:ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>]
Sent: Monday, May 25, 2015 6:29 AM
To: Chaudhary, Umesh
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Websphere MQ as a data source for Apache Spark Streaming

Hi Umesh,

You can connect to Spark Streaming with MQTT  refer to the example.

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala



Thanks
Arush



On Mon, May 25, 2015 at 3:43 PM, umesh9794 
mailto:umesh.chaudh...@searshc.com>> wrote:
I was digging into the possibilities for Websphere MQ as a data source for
spark-streaming becuase it is needed in one of our use case. I got to know
that  MQTT <http://mqtt.org/>   is the protocol that supports the
communication from MQ data structures but since I am a newbie to spark
streaming I need some working examples for the same. Did anyone try to
connect the MQ with spark streaming. Please devise the best way for doing
so.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Websphere-MQ-as-a-data-source-for-Apache-Spark-Streaming-tp23013.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>



--

[Image removed by sender. Sigmoid 
Analytics]<http://htmlsig.com/www.sigmoidanalytics.com>

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || 
www.sigmoidanalytics.com<http://www.sigmoidanalytics.com/>
This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.



--

[Image removed by sender. Sigmoid 
Analytics]<http://htmlsig.com/www.sigmoidanalytics.com>

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || 
www.sigmoidanalytics.com<http://www.sigmoidanalytics.com/>

This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.