from:"sachin aggarwal \(JIRA\)"

[jira] [Commented] (SPARK-19624) --conf spark.app.name=test is not working with spark-shell/pyspark

2017-02-19 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874027#comment-15874027
 ] 

Sachin Aggarwal commented on SPARK-19624:
-

[~srowen] I agree that --name is working but we should not ignore --conf 
spark.app.name="test" 

> --conf spark.app.name=test is not working with spark-shell/pyspark
> --
>
> Key: SPARK-19624
> URL: https://issues.apache.org/jira/browse/SPARK-19624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Sachin Aggarwal
>Priority: Minor
>
> On starting a spark-shell or pyshark shell if we pass --conf 
> spark.app.name=test it is not working as --name "Spark shell"  takes 
> precedence over --conf 
> line refrence for spark-shell
> https://github.com/apache/spark/blob/master/bin/spark-shell#L53
> similarly line refrence for pyspark 
> https://github.com/apache/spark/blob/master/bin/pyspark#L77



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19624) --conf spark.app.name=test is not working with spark-shell/pyspark

2017-02-16 Thread Sachin Aggarwal (JIRA)

Sachin Aggarwal created SPARK-19624:
---

 Summary: --conf spark.app.name=test is not working with 
spark-shell/pyspark
 Key: SPARK-19624
 URL: https://issues.apache.org/jira/browse/SPARK-19624
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Shell, Spark Submit
Affects Versions: 2.1.0, 2.0.0, 1.6.0
Reporter: Sachin Aggarwal
Priority: Minor


On starting a spark-shell or pyshark shell if we pass --conf 
spark.app.name=test it is not working as --name "Spark shell"  takes precedence 
over --conf 
line refrence for spark-shell
https://github.com/apache/spark/blob/master/bin/spark-shell#L53
similarly line refrence for pyspark 
https://github.com/apache/spark/blob/master/bin/pyspark#L77



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15183) Adding outputMode to structure Streaming Experimental Api

2016-06-05 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal closed SPARK-15183.
---
Resolution: Duplicate

> Adding outputMode to structure Streaming Experimental Api
> -
>
> Key: SPARK-15183
> URL: https://issues.apache.org/jira/browse/SPARK-15183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Reporter: Sachin Aggarwal
>Priority: Trivial
>
> while experimenting with structure streaming. I found that mode() is used for 
> non-continuous queries while outputMode() is used for continuous queries.
> ouputMode is not defined, so I have written the some raw implementation and 
> test cases just to make sure the streaming app works 
> Note:-
> /** Start a query */
>   private[sql] def startQuery(
>   name: String,
>   checkpointLocation: String,
>   df: DataFrame,
>   sink: Sink,
>   trigger: Trigger = ProcessingTime(0),
>   triggerClock: Clock = new SystemClock(),
>   outputMode: OutputMode = Append): ContinuousQuery = {
> As per me outputMode should be defined before triggerClock, the constructor 
> with  outputMode defined will be used more often then triggerClock.
> I have added triggerClock() method also 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15183) Adding outputMode to structure Streaming Experimental Api

2016-05-06 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-15183:

Summary: Adding outputMode to structure Streaming Experimental Api  (was: 
adding outputMode to structure Streaming Experimental Api)

> Adding outputMode to structure Streaming Experimental Api
> -
>
> Key: SPARK-15183
> URL: https://issues.apache.org/jira/browse/SPARK-15183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Reporter: Sachin Aggarwal
>Priority: Trivial
>
> while experimenting with structure streaming. I found that mode() is used for 
> non-continuous queries while outputMode() is used for continuous queries.
> ouputMode is not defined, so I have written the some raw implementation and 
> test cases just to make sure the streaming app works 
> Note:-
> /** Start a query */
>   private[sql] def startQuery(
>   name: String,
>   checkpointLocation: String,
>   df: DataFrame,
>   sink: Sink,
>   trigger: Trigger = ProcessingTime(0),
>   triggerClock: Clock = new SystemClock(),
>   outputMode: OutputMode = Append): ContinuousQuery = {
> As per me outputMode should be defined before triggerClock, the constructor 
> with  outputMode defined will be used more often then triggerClock.
> I have added triggerClock() method also 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15183) adding outputMode to structure Streaming Experimental Api

2016-05-06 Thread Sachin Aggarwal (JIRA)

Sachin Aggarwal created SPARK-15183:
---

 Summary: adding outputMode to structure Streaming Experimental Api
 Key: SPARK-15183
 URL: https://issues.apache.org/jira/browse/SPARK-15183
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Streaming
Reporter: Sachin Aggarwal
Priority: Trivial


while experimenting with structure streaming. I found that mode() is used for 
non-continuous queries while outputMode() is used for continuous queries.
ouputMode is not defined, so I have written the some raw implementation and 
test cases just to make sure the streaming app works 

Note:-
/** Start a query */
  private[sql] def startQuery(
  name: String,
  checkpointLocation: String,
  df: DataFrame,
  sink: Sink,
  trigger: Trigger = ProcessingTime(0),
  triggerClock: Clock = new SystemClock(),
  outputMode: OutputMode = Append): ContinuousQuery = {
As per me outputMode should be defined before triggerClock, the constructor 
with  outputMode defined will be used more often then triggerClock.
I have added triggerClock() method also 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15183) adding outputMode to structure Streaming Experimental Api

2016-05-06 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274241#comment-15274241
 ] 

Sachin Aggarwal commented on SPARK-15183:
-

please mark closed if not qualifies for jira

> adding outputMode to structure Streaming Experimental Api
> -
>
> Key: SPARK-15183
> URL: https://issues.apache.org/jira/browse/SPARK-15183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Reporter: Sachin Aggarwal
>Priority: Trivial
>
> while experimenting with structure streaming. I found that mode() is used for 
> non-continuous queries while outputMode() is used for continuous queries.
> ouputMode is not defined, so I have written the some raw implementation and 
> test cases just to make sure the streaming app works 
> Note:-
> /** Start a query */
>   private[sql] def startQuery(
>   name: String,
>   checkpointLocation: String,
>   df: DataFrame,
>   sink: Sink,
>   trigger: Trigger = ProcessingTime(0),
>   triggerClock: Clock = new SystemClock(),
>   outputMode: OutputMode = Append): ContinuousQuery = {
> As per me outputMode should be defined before triggerClock, the constructor 
> with  outputMode defined will be used more often then triggerClock.
> I have added triggerClock() method also 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15183) adding outputMode to structure Streaming Experimental Api

2016-05-06 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274241#comment-15274241
 ] 

Sachin Aggarwal edited comment on SPARK-15183 at 5/6/16 3:45 PM:
-

please mark closed if not qualifies as jira


was (Author: sachin aggarwal):
please mark closed if not qualifies for jira

> adding outputMode to structure Streaming Experimental Api
> -
>
> Key: SPARK-15183
> URL: https://issues.apache.org/jira/browse/SPARK-15183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Reporter: Sachin Aggarwal
>Priority: Trivial
>
> while experimenting with structure streaming. I found that mode() is used for 
> non-continuous queries while outputMode() is used for continuous queries.
> ouputMode is not defined, so I have written the some raw implementation and 
> test cases just to make sure the streaming app works 
> Note:-
> /** Start a query */
>   private[sql] def startQuery(
>   name: String,
>   checkpointLocation: String,
>   df: DataFrame,
>   sink: Sink,
>   trigger: Trigger = ProcessingTime(0),
>   triggerClock: Clock = new SystemClock(),
>   outputMode: OutputMode = Append): ContinuousQuery = {
> As per me outputMode should be defined before triggerClock, the constructor 
> with  outputMode defined will be used more often then triggerClock.
> I have added triggerClock() method also 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-05-02 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261802#comment-15261802
 ] 

Sachin Aggarwal edited comment on SPARK-14597 at 5/2/16 11:50 AM:
--

Hi [~prashant_] I have generated few metric to support the usefulness, here are 
the graphs for the same.

!withSortByKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation the 
jobSetCreationDelay is nearly equal to processingDelay

!WithOutSortByKey.png|width=300,height=400!
this graph explains that when the sortByKey operation is removed the 
jobSetCreationDelay is nearly equal to 10ms and there is a little increase in 
processingDelay.

with these two graphs we can see the usefulness of JobSetCreationDelay metric, 
for the end user to analyze where the time is consumed.


was (Author: sachin aggarwal):
Hi [~prashant_] I have generated few metric to support the usefulness, here are 
the graphs for the same.

!withSortByKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation the 
jobSetCreationDelay is nearly equal to processingDelay

!WithOutSortByKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation is removed the 
jobSetCreationDelay is nearly equal to 10ms and there is a little increase in 
processingDelay.

with these two graphs we can see the usefulness of JobSetCreationDelay metric, 
for the end user to analyze where the time is consumed.

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: WithOutSortByKey.png, withSortByKey.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-05-02 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Attachment: withSortByKey.png

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: WithOutSortByKey.png, withSortByKey.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-05-02 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261802#comment-15261802
 ] 

Sachin Aggarwal edited comment on SPARK-14597 at 5/2/16 11:47 AM:
--

Hi [~prashant_] I have generated few metric to support the usefulness, here are 
the graphs for the same.

!withSortByKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation the 
jobSetCreationDelay is nearly equal to processingDelay

!WithOutSortByKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation is removed the 
jobSetCreationDelay is nearly equal to 10ms and there is a little increase in 
processingDelay.

with these two graphs we can see the usefulness of JobSetCreationDelay metric, 
for the end user to analyze where the time is consumed.


was (Author: sachin aggarwal):
Hi [~prashant_] I have generated few metric to support the usefulness, here are 
the graphs for the same.

!with_sortbyKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation the 
jobSetCreationDelay is nearly equal to processingDelay

!WithOutSortByKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation is removed the 
jobSetCreationDelay is nearly equal to 10ms and there is a little increase in 
processingDelay.

with these two graphs we can see the usefulness of JobSetCreationDelay metric, 
for the end user to analyze where the time is consumed.

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: WithOutSortByKey.png, withSortByKey.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-05-02 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Attachment: (was: with_sortbyKey.png)

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: WithOutSortByKey.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-05-02 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Attachment: (was: 2.png)

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: WithOutSortByKey.png, with_sortbyKey.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-05-02 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Attachment: (was: 1.png)

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: WithOutSortByKey.png, with_sortbyKey.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-05-02 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Comment: was deleted

(was: !with_sortbyKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation the 
jobSetCreationDelay is nearly equal to processingDelay

!WithOutSortByKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation is removed the 
jobSetCreationDelay is nearly equal to 10ms and there is a little increase in 
processingDelay.)

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png, WithOutSortByKey.png, with_sortbyKey.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-05-02 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261802#comment-15261802
 ] 

Sachin Aggarwal edited comment on SPARK-14597 at 5/2/16 11:40 AM:
--

Hi [~prashant_] I have generated few metric to support the usefulness, here are 
the graphs for the same.

!with_sortbyKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation the 
jobSetCreationDelay is nearly equal to processingDelay

!WithOutSortByKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation is removed the 
jobSetCreationDelay is nearly equal to 10ms and there is a little increase in 
processingDelay.

with these two graphs we can see the usefulness of JobSetCreationDelay metric, 
for the end user to analyze where the time is consumed.


was (Author: sachin aggarwal):
Hi [~prashant_] I have generated few metric to support the usefulness, here are 
the graphs for the same.

Image 1 shows the increase in jobSetGenerateTimeDelay with decrease in 
batchInterval 

Image 2 shows for a batch Interval of 100ms, how jobSetGenerateTimeDelay keeps 
on increasing with time.

!1.png|width=300,height=400!
!2.png|width=300,height=400!

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png, WithOutSortByKey.png, with_sortbyKey.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-05-02 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266297#comment-15266297
 ] 

Sachin Aggarwal edited comment on SPARK-14597 at 5/2/16 11:33 AM:
--

!with_sortbyKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation the 
jobSetCreationDelay is nearly equal to processingDelay

!WithOutSortByKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation is removed the 
jobSetCreationDelay is nearly equal to 10ms and there is a little increase in 
processingDelay.


was (Author: sachin aggarwal):
!WithOutSortByKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation the 
jobSetCreationDelay is nearly equal to processingDelay

!with_sortbyKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation is removed the 
jobSetCreationDelay is nearly equal to 10ms and there is a little increase in 
processingDelay.

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png, WithOutSortByKey.png, with_sortbyKey.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-05-02 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266297#comment-15266297
 ] 

Sachin Aggarwal edited comment on SPARK-14597 at 5/2/16 9:50 AM:
-

!WithOutSortByKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation the 
jobSetCreationDelay is nearly equal to processingDelay

!with_sortbyKey.png|width=300,height=400!
this graph explains that when there is sortByKey operation is removed the 
jobSetCreationDelay is nearly equal to 10ms and there is a little increase in 
processingDelay.


was (Author: sachin aggarwal):
!WithOutSortByKey.png|!

!with_sortbyKey.png|!

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png, WithOutSortByKey.png, with_sortbyKey.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-05-02 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Attachment: with_sortbyKey.png
WithOutSortByKey.png

!WithOutSortByKey.png|!

!with_sortbyKey.png|!

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png, WithOutSortByKey.png, with_sortbyKey.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-28 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261802#comment-15261802
 ] 

Sachin Aggarwal edited comment on SPARK-14597 at 4/28/16 9:37 AM:
--

Hi [~prashant_] I have generated few metric to support the usefulness, here are 
the graphs for the same.

Image 1 shows the increase in jobSetGenerateTimeDelay with decrease in 
batchInterval 

Image 2 shows for a batch Interval of 100ms, how jobSetGenerateTimeDelay keeps 
on increasing with time.

!1.png|width=300,height=400!
!2.png|width=300,height=400!


was (Author: sachin aggarwal):
Hi [~prashant_] I have generated few metric to support the usefulness, here are 
the graphs for the same.

Image 1 shows the increase in jobSetGenerateTimeDelay with decrease in 
batchInterval 

Image 2 shows for a batch Interval of 100ms, how jobSetGenerateTimeDelay keeps 
on increasing with time.

!1.png!
!2.png!

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-28 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261802#comment-15261802
 ] 

Sachin Aggarwal edited comment on SPARK-14597 at 4/28/16 9:35 AM:
--

Hi [~prashant_] I have generated few metric to support the usefulness, here are 
the graphs for the same.

Image 1 shows the increase in jobSetGenerateTimeDelay with decrease in 
batchInterval 

Image 2 shows for a batch Interval of 100ms, how jobSetGenerateTimeDelay keeps 
on increasing with time.

!1.png!
!2.png!


was (Author: sachin aggarwal):
Hi [~prashant_] I have generated few metric to support the usefulness, here are 
the graphs for the same.

Image 1 shows the increase in jobSetGenerateTimeDelay with decrease in 
batchInterval 

Image 2 shows for a batch Interval of 100ms, how jobSetGenerateTimeDelay keeps 
on increasing with time.

!1.png|thumbnail!
!2.png|thumbnail!

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-28 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261858#comment-15261858
 ] 

Sachin Aggarwal edited comment on SPARK-14597 at 4/28/16 9:33 AM:
--

based on over use case and analysis we have found that the major contributor to 
job generate time are :
1) sortByKey function as it submits a job to cluster for sketch function (this 
is a directly proportional to the data size being processed in that batch)
2) In direct kafka fetching offset takes nearly 2 ms 
3) nearly for every two transformation we add 1ms to our job generate time 
delay , this time is consumed i executing getOrCompute method.


was (Author: sachin aggarwal):
based on over use case and analysis we have found that the major contributor to 
job generate time are :
1) sortByKey function as it submits a job to cluster for sketch function
2) In direct kafka fetching offset takes nearly 2 ms 
3) nearly for every two transformation we add 1ms to our job generate time 
delay , this time is consumed i executing getOrCompute method.

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-28 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261858#comment-15261858
 ] 

Sachin Aggarwal commented on SPARK-14597:
-

based on over use case and analysis we have found that the major contributor to 
job generate time are :
1) sortByKey function as it submits a job to cluster for sketch function
2) In direct kafka fetching offset takes nearly 2 ms 
3) nearly for every two transformation we add 1ms to our job generate time 
delay , this time is consumed i executing getOrCompute method.

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-28 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Attachment: 2.png
1.png

Hi [~prashant_] I have generated few metric to support the usefulness, here are 
the graphs for the same.

Image 1 shows the increase in jobSetGenerateTimeDelay with decrease in 
batchInterval 

Image 2 shows for a batch Interval of 100ms, how jobSetGenerateTimeDelay keeps 
on increasing with time.

!1.png|thumbnail!
!2.png|thumbnail!

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-28 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Attachment: (was: 2.png)

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-28 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Comment: was deleted

(was: Hi [~prashant_] i have generated few metric to support the usefulness, 
here are the graphs for the same 
!1.jpg|thumbnail!
!2.jpg|thumbnail!)

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-28 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Attachment: (was: 1.png)

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-28 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Attachment: 2.png
1.png

adding analysis graph

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-28 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261797#comment-15261797
 ] 

Sachin Aggarwal edited comment on SPARK-14597 at 4/28/16 8:59 AM:
--

Hi [~prashant_] i have generated few metric to support the usefulness, here are 
the graphs for the same 
!1.jpg|thumbnail!
!2.jpg|thumbnail!


was (Author: sachin aggarwal):
adding analysis graph

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-20 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250661#comment-15250661
 ] 

Sachin Aggarwal commented on SPARK-14597:
-

hi Mario,

I have extended the approach2 added a new parameter to job class to capture the 
job creation time delay, once the job gets created, I set time taken to create 
each job and user can get this information in StreamingListener methods 
onOutputOperationStarted and onOutputOperationCompleted corresponding to each 
job, 

for batch level data user can use 
batchCompleted.batchInfo.batchJobSetCreationDelay in onBatchCompleted method of 
StreamingListener




> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-13 Thread Sachin Aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239122#comment-15239122
 ] 

Sachin Aggarwal commented on SPARK-14597:
-

I can see 2 approaches to providing the above additional metrics. I have 
attached a PR for each approach. This is to review either approach and we can 
pick the better one
   
1) add new events to listener bus:-
add more event case classes 
 case class StreamingListenerBatchGenerateStarted(time: Time) extends 
StreamingListenerEvent
 case class StreamingListenerBatchGenerateCompleted(time: Time) extends 
StreamingListenerEvent
 case class StreamingListenerCheckpointingStarted(time: Time) extends 
StreamingListenerEvent
 case class StreamingListenerCheckpointingCompleted(time: Time) extends 
StreamingListenerEvent

and new functions to the listener interface for receiving information
 def onBatchGenerateStarted(batchGenerateStarted: 
StreamingListenerBatchGenerateStarted) { }
 def onBatchGenerateCompleted(batchGenerateCompleted: 
StreamingListenerBatchGenerateCompleted) { }
 def onCheckpointingStarted(checkpointingStarted: 
StreamingListenerCheckpointingStarted) { }
 def onCheckpointingCompleted(checkpointingCompleted: 
StreamingListenerCheckpointingCompleted) { }

2) add one parameter to JobSet and pass it to BatchInfo Class to track JobSet 
CreationDelay for each batch. As jobSet CreationDelay is related to a batch, 
this can be a part of BatchInfo. For checkpointing we can use listener approach 
as checkpointing  as checkpointing is triggered at checkpoint interval not 
manadaterally at batch interval.

case class JobSet(
   time: Time,
   jobSetCreationDelay: Option[Long],
   jobs: Seq[Job],
   streamIdToInputInfo: Map[Int, StreamInputInfo] = Map.empty) {

   case class BatchInfo(
   batchTime: Time,
   creationDelay: Option[Long],
   streamIdToInputInfo: Map[Int, StreamInputInfo],
   submissionTime: Long,
   processingStartTime: Option[Long],
   processingEndTime: Option[Long],
   outputOperationInfos: Map[Int, OutputOperationInfo]
 ) {
} 

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-13 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Description: 
While looking to tune our streaming application, the piece of info we were 
looking for was actual processing time per batch. The 
StreamingListener.onBatchCompleted event provides a BatchInfo object that 
provided this information. It provides the following data
 - processingDelay
 - schedulingDelay
 - totalDelay
 - Submission Time
 The above are essentially calculated from the streaming JobScheduler clocking 
the processingStartTime and processingEndTime for each JobSet. Another metric 
available is submissionTime which is when a Jobset was put on the Streaming 
Scheduler's Queue. 
 
So we took processing delay as our actual processing time per batch. However to 
maintain a stable streaming application, we found that the our batch interval 
had to be a little less than DOUBLE of the processingDelay metric reported. (We 
are using a DirectKafkaInputStream). On digging further, we found that 
processingDelay is only clocking time spent in the ForEachRDD closure of the 
Streaming application and that JobGenerator's graph.generateJobs 
(https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
 method takes a significant more amount of time.

 Thus a true reflection of processing time is
 a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
 b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
 c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
metric)
 d - Time spent in Jobset's job run (existing processingDelay metric)
 
 Additionally a JobGeneratorQueue delay (#a) could be due to either 
graph.generateJobs taking longer than batchInterval or other JobGenerator 
events like checkpointing adding up time. Thus it would be beneficial to report 
time taken by the checkpointing Job as well

  was:
While looking to tune our streaming application, the piece of info we were 
looking for was actual processing time per batch. The 
StreamingListener.onBatchCompleted event provides a BatchInfo object that 
provided this information. It provides the following data
 - processingDelay
 - schedulingDelay
 - totalDelay
 - Submission Time
 The above are essentially calculated from the streaming JobScheduler clocking 
the processingStartTime and processingEndTime for each JobSet. Another metric 
available is submissionTime which is when a Jobset was put on the Streaming 
Scheduler's Queue. 
 
So we took processing delay as our actual processing time per batch. However to 
maintain a stable streaming application, we found that the our batch interval 
had to be a little less than DOUBLE of the processingDelay metric reported. (We 
are using a DirectKafkaInputStream). On digging further, we found that 
processingDelay is only clocking time spent in the ForEachRDD closure of the 
Streaming application and that JobGenerator's graph.generateJobs 
(https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
 method takes a significant more amount of time.

 Thus a true reflection of processing time is
 a - Time spent in JobGenerator's Job Queue (jobGenerator scheduling delay or 
JobGeneratorQueue delay)
 b - Time spent in JobGenerator's graph.generateJobs (JobSetCreation delay)
 c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
metric)
 d - Time spent in Jobset's job run (existing processingDelay metric)
 
 Additionally a JobGeneratorQueue delay (#a) could be due to either 
graph.generateJobs taking longer than batchInterval or other JobGenerator 
events like checkpointing adding up time. Thus it would be beneficial to report 
time taken by the checkpointing Job as well


> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
>

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-13 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Description: 
While looking to tune our streaming application, the piece of info we were 
looking for was actual processing time per batch. The 
StreamingListener.onBatchCompleted event provides a BatchInfo object that 
provided this information. It provides the following data
 - processingDelay
 - schedulingDelay
 - totalDelay
 - Submission Time
 The above are essentially calculated from the streaming JobScheduler clocking 
the processingStartTime and processingEndTime for each JobSet. Another metric 
available is submissionTime which is when a Jobset was put on the Streaming 
Scheduler's Queue. 
 
So we took processing delay as our actual processing time per batch. However to 
maintain a stable streaming application, we found that the our batch interval 
had to be a little less than DOUBLE of the processingDelay metric reported. (We 
are using a DirectKafkaInputStream). On digging further, we found that 
processingDelay is only clocking time spent in the ForEachRDD closure of the 
Streaming application and that JobGenerator's graph.generateJobs 
(https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
 method takes a significant more amount of time.

 Thus a true reflection of processing time is
 a - Time spent in JobGenerator's Job Queue (jobGenerator scheduling delay or 
JobGeneratorQueue delay)
 b - Time spent in JobGenerator's graph.generateJobs (JobSetCreation delay)
 c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
metric)
 d - Time spent in Jobset's job run (existing processingDelay metric)
 
 Additionally a JobGeneratorQueue delay (#a) could be due to either 
graph.generateJobs taking longer than batchInterval or other JobGenerator 
events like checkpointing adding up time. Thus it would be beneficial to report 
time taken by the checkpointing Job as well

  was:
While looking to tune our streaming application, the piece of info we were 
looking for was actual processing time per batch. The 
StreamingListener.onBatchCompleted event provides a BatchInfo object that 
provided this information. It provides the following data
 - processingDelay
 - schedulingDelay
 - totalDelay
 - Submission Time
 The above are essentially calculated from the streaming JobScheduler clocking 
the processingStartTime and processingEndTime for each JobSet. Another metric 
available is submissionTime which is when a Jobset was put on the Streaming 
Scheduler's Queue. 
 
So we took processing delay as our actual processing time per batch. However to 
maintain a stable streaming application, we found that the our batch interval 
had to be a little less than DOUBLE of the processingDelay metric reported. (We 
are using a DirectKafkaInputStream). On digging further, we found that 
processingDelay is only clocking time spent in the ForEachRDD closure of the 
Streaming application and that JobGenerator's graph.generateJobs 
(https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
 method takes a significant more amount of time.

 Thus a true reflection of processing time is
 a - Time spent in JobGenerator's Job Queue (jobGenerator scheduling delay or 
JobGeneratorQueue delay)
 b - Time spent in JobGenerator's graph.generateJobs 
(generateJobProcessingDelay)
 c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
metric)
 d - Time spent in Jobset's job run (existing processingDelay metric)
 
 Additionally a JobGeneratorQueue delay (#a) could be due to either 
graph.generateJobs taking longer than batchInterval or other JobGenerator 
events like checkpointing adding up time. Thus it would be beneficial to report 
time taken by the checkpointing Job as well


> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTim

[jira] [Updated] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-13 Thread Sachin Aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal updated SPARK-14597:

Component/s: Spark Core

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (jobGenerator scheduling delay or 
> JobGeneratorQueue delay)
>  b - Time spent in JobGenerator's graph.generateJobs 
> (generateJobProcessingDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-13 Thread Sachin Aggarwal (JIRA)

Sachin Aggarwal created SPARK-14597:
---

 Summary: Streaming Listener timing metrics should include time 
spent in JobGenerator's graph.generateJobs
 Key: SPARK-14597
 URL: https://issues.apache.org/jira/browse/SPARK-14597
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.6.1, 2.0.0
Reporter: Sachin Aggarwal
Priority: Minor


While looking to tune our streaming application, the piece of info we were 
looking for was actual processing time per batch. The 
StreamingListener.onBatchCompleted event provides a BatchInfo object that 
provided this information. It provides the following data
 - processingDelay
 - schedulingDelay
 - totalDelay
 - Submission Time
 The above are essentially calculated from the streaming JobScheduler clocking 
the processingStartTime and processingEndTime for each JobSet. Another metric 
available is submissionTime which is when a Jobset was put on the Streaming 
Scheduler's Queue. 
 
So we took processing delay as our actual processing time per batch. However to 
maintain a stable streaming application, we found that the our batch interval 
had to be a little less than DOUBLE of the processingDelay metric reported. (We 
are using a DirectKafkaInputStream). On digging further, we found that 
processingDelay is only clocking time spent in the ForEachRDD closure of the 
Streaming application and that JobGenerator's graph.generateJobs 
(https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
 method takes a significant more amount of time.

 Thus a true reflection of processing time is
 a - Time spent in JobGenerator's Job Queue (jobGenerator scheduling delay or 
JobGeneratorQueue delay)
 b - Time spent in JobGenerator's graph.generateJobs 
(generateJobProcessingDelay)
 c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
metric)
 d - Time spent in Jobset's job run (existing processingDelay metric)
 
 Additionally a JobGeneratorQueue delay (#a) could be due to either 
graph.generateJobs taking longer than batchInterval or other JobGenerator 
events like checkpointing adding up time. Thus it would be beneficial to report 
time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-09 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138498#comment-15138498
 ] 

sachin aggarwal commented on SPARK-13172:
-

Now the question is where should I add this function so it can be leveraged 
across modules?

> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-08 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138423#comment-15138423
 ] 

sachin aggarwal edited comment on SPARK-13172 at 2/9/16 6:49 AM:
-

There are two ways we can proceed :- first use printStackTrace and second to 
use mkString()

1) can be ecapsulated in function  :-
{code}
def getStackTraceAsString(t: Throwable) = {
val sw = new StringWriter
t.printStackTrace(new PrintWriter(sw))
sw.toString
}
{code}

2) println(t.getStackTrace.mkString("\n"))

mkstring approach give extractly same string as old function getStackTraceString
but the output of first approach is more readable.
h3.Example

{code:title=TrySuccessFailure.scala|borderStyle=solid}
import scala.util.{Try, Success, Failure}
import java.io._

object TrySuccessFailure extends App {

  badAdder(3) match {
case Success(i) => println(s"success, i = $i")
case Failure(t) =>
  // this works, but it's not too useful/readable
  println(t.getStackTrace.mkString("\n"))
  println("===")
  println(t.getStackTraceString)
  // this works much better
  val sw = new StringWriter
  t.printStackTrace(new PrintWriter(sw))
  println(sw.toString)
  }
  def badAdder(a: Int): Try[Int] = {
Try({
  val b = a + 1
  if (b == 3) b else {
val ioe = new IOException("Boom!")
throw new AlsException("Bummer!", ioe)
  }
})
  }
  class AlsException(s: String, e: Exception) extends Exception(s: String, e: 
Exception)
}
{code}



was (Author: sachin aggarwal):
there are two ways we can proceed :- first use printStackTrace and second to 
use mkString()

1) can be ecapsulated in function  :-
{code}
def getStackTraceAsString(t: Throwable) = {
val sw = new StringWriter
t.printStackTrace(new PrintWriter(sw))
sw.toString
}
{code}

2) println(t.getStackTrace.mkString("\n"))

mkstring approach give extractly same string as old function getStackTraceString
but the output of first approach is more readable.
h3.Example

{code:title=TrySuccessFailure.scala|borderStyle=solid}
import scala.util.{Try, Success, Failure}
import java.io._

object TrySuccessFailure extends App {

  badAdder(3) match {
case Success(i) => println(s"success, i = $i")
case Failure(t) =>
  // this works, but it's not too useful/readable
  println(t.getStackTrace.mkString("\n"))
  println("===")
  println(t.getStackTraceString)
  // this works much better
  val sw = new StringWriter
  t.printStackTrace(new PrintWriter(sw))
  println(sw.toString)
  }
  def badAdder(a: Int): Try[Int] = {
Try({
  val b = a + 1
  if (b == 3) b else {
val ioe = new IOException("Boom!")
throw new AlsException("Bummer!", ioe)
  }
})
  }
  class AlsException(s: String, e: Exception) extends Exception(s: String, e: 
Exception)
}
{code}


> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-08 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138423#comment-15138423
 ] 

sachin aggarwal edited comment on SPARK-13172 at 2/9/16 6:35 AM:
-

there are two ways we can proceed :- first use printStackTrace and second to 
use mkString()

1) can be ecapsulated in function  :-
{code}
def getStackTraceAsString(t: Throwable) = {
val sw = new StringWriter
t.printStackTrace(new PrintWriter(sw))
sw.toString
}
{code}

2) println(t.getStackTrace.mkString("\n"))

mkstring approach give extractly same string as old function getStackTraceString
but the output of first approach is more readable.
h3.Example

{code:title=TrySuccessFailure.scala|borderStyle=solid}
import scala.util.{Try, Success, Failure}
import java.io._

object TrySuccessFailure extends App {

  badAdder(3) match {
case Success(i) => println(s"success, i = $i")
case Failure(t) =>
  // this works, but it's not too useful/readable
  println(t.getStackTrace.mkString("\n"))
  println("===")
  println(t.getStackTraceString)
  // this works much better
  val sw = new StringWriter
  t.printStackTrace(new PrintWriter(sw))
  println(sw.toString)
  }
  def badAdder(a: Int): Try[Int] = {
Try({
  val b = a + 1
  if (b == 3) b else {
val ioe = new IOException("Boom!")
throw new AlsException("Bummer!", ioe)
  }
})
  }
  class AlsException(s: String, e: Exception) extends Exception(s: String, e: 
Exception)
}
{code}



was (Author: sachin aggarwal):
there are two ways we can proceed :- first use printStackTrace and second to 
use mkString()

1) can be ecapsulated in function  :-
{code}
def getStackTraceAsString(t: Throwable) = {
val sw = new StringWriter
t.printStackTrace(new PrintWriter(sw))
sw.toString
}
{code}

2) println(t.getStackTrace.mkString("\n"))


mkstring approach give extractly same string as old function getStackTraceString

but the output of first approach is more readable.

try this code to see the difference 

{code:title=TrySuccessFailure.scala|borderStyle=solid}
import scala.util.{Try, Success, Failure}
import java.io._

object TrySuccessFailure extends App {

  badAdder(3) match {
case Success(i) => println(s"success, i = $i")
case Failure(t) =>
  // this works, but it's not too useful/readable
  println(t.getStackTrace.mkString("\n"))
  println("===")
  println(t.getStackTraceString)
  // this works much better
  val sw = new StringWriter
  t.printStackTrace(new PrintWriter(sw))
  println(sw.toString)
  }
  def badAdder(a: Int): Try[Int] = {
Try({
  val b = a + 1
  if (b == 3) b else {
val ioe = new IOException("Boom!")
throw new AlsException("Bummer!", ioe)
  }
})
  }
  class AlsException(s: String, e: Exception) extends Exception(s: String, e: 
Exception)
}
{code}


> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-08 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138423#comment-15138423
 ] 

sachin aggarwal edited comment on SPARK-13172 at 2/9/16 6:33 AM:
-

there are two ways we can proceed :- first use printStackTrace and second to 
use mkString()

1) can be ecapsulated in function  :-
def getStackTraceAsString(t: Throwable) = {
val sw = new StringWriter
t.printStackTrace(new PrintWriter(sw))
sw.toString
}

2) println(t.getStackTrace.mkString("\n"))


mkstring approach give extractly same string as old function getStackTraceString

but the output of first approach is more readable.

try this code to see the difference 

{code:title=TrySuccessFailure.scala|borderStyle=solid}
import scala.util.{Try, Success, Failure}
import java.io._

object TrySuccessFailure extends App {

  badAdder(3) match {
case Success(i) => println(s"success, i = $i")
case Failure(t) =>
  // this works, but it's not too useful/readable
  println(t.getStackTrace.mkString("\n"))
  println("===")
  println(t.getStackTraceString)
  // this works much better
  val sw = new StringWriter
  t.printStackTrace(new PrintWriter(sw))
  println(sw.toString)
  }
  def badAdder(a: Int): Try[Int] = {
Try({
  val b = a + 1
  if (b == 3) b else {
val ioe = new IOException("Boom!")
throw new AlsException("Bummer!", ioe)
  }
})
  }
  class AlsException(s: String, e: Exception) extends Exception(s: String, e: 
Exception)
}
{code}



was (Author: sachin aggarwal):
there are two ways we can proceed :- first use printStackTrace and second to 
use mkString()

1) can be ecapsulated in function  :-
def getStackTraceAsString(t: Throwable) = {
val sw = new StringWriter
t.printStackTrace(new PrintWriter(sw))
sw.toString
}

2) println(t.getStackTrace.mkString("\n"))


mkstring approach give extractly same string as old function getStackTraceString

but the output of first approach is more readable.

try this code to see the difference 

```
import scala.util.{Try, Success, Failure}
import java.io._

object TrySuccessFailure extends App {

  badAdder(3) match {
case Success(i) => println(s"success, i = $i")
case Failure(t) =>
  // this works, but it's not too useful/readable
  println(t.getStackTrace.mkString("\n"))
  println("===")
  println(t.getStackTraceString)
  // this works much better
  val sw = new StringWriter
  t.printStackTrace(new PrintWriter(sw))
  println(sw.toString)
  }
  def badAdder(a: Int): Try[Int] = {
Try({
  val b = a + 1
  if (b == 3) b else {
val ioe = new IOException("Boom!")
throw new AlsException("Bummer!", ioe)
  }
})
  }
  class AlsException(s: String, e: Exception) extends Exception(s: String, e: 
Exception)
}
```


> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-08 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138423#comment-15138423
 ] 

sachin aggarwal edited comment on SPARK-13172 at 2/9/16 6:33 AM:
-

there are two ways we can proceed :- first use printStackTrace and second to 
use mkString()

1) can be ecapsulated in function  :-
{code}
def getStackTraceAsString(t: Throwable) = {
val sw = new StringWriter
t.printStackTrace(new PrintWriter(sw))
sw.toString
}
{code}

2) println(t.getStackTrace.mkString("\n"))


mkstring approach give extractly same string as old function getStackTraceString

but the output of first approach is more readable.

try this code to see the difference 

{code:title=TrySuccessFailure.scala|borderStyle=solid}
import scala.util.{Try, Success, Failure}
import java.io._

object TrySuccessFailure extends App {

  badAdder(3) match {
case Success(i) => println(s"success, i = $i")
case Failure(t) =>
  // this works, but it's not too useful/readable
  println(t.getStackTrace.mkString("\n"))
  println("===")
  println(t.getStackTraceString)
  // this works much better
  val sw = new StringWriter
  t.printStackTrace(new PrintWriter(sw))
  println(sw.toString)
  }
  def badAdder(a: Int): Try[Int] = {
Try({
  val b = a + 1
  if (b == 3) b else {
val ioe = new IOException("Boom!")
throw new AlsException("Bummer!", ioe)
  }
})
  }
  class AlsException(s: String, e: Exception) extends Exception(s: String, e: 
Exception)
}
{code}



was (Author: sachin aggarwal):
there are two ways we can proceed :- first use printStackTrace and second to 
use mkString()

1) can be ecapsulated in function  :-
def getStackTraceAsString(t: Throwable) = {
val sw = new StringWriter
t.printStackTrace(new PrintWriter(sw))
sw.toString
}

2) println(t.getStackTrace.mkString("\n"))


mkstring approach give extractly same string as old function getStackTraceString

but the output of first approach is more readable.

try this code to see the difference 

{code:title=TrySuccessFailure.scala|borderStyle=solid}
import scala.util.{Try, Success, Failure}
import java.io._

object TrySuccessFailure extends App {

  badAdder(3) match {
case Success(i) => println(s"success, i = $i")
case Failure(t) =>
  // this works, but it's not too useful/readable
  println(t.getStackTrace.mkString("\n"))
  println("===")
  println(t.getStackTraceString)
  // this works much better
  val sw = new StringWriter
  t.printStackTrace(new PrintWriter(sw))
  println(sw.toString)
  }
  def badAdder(a: Int): Try[Int] = {
Try({
  val b = a + 1
  if (b == 3) b else {
val ioe = new IOException("Boom!")
throw new AlsException("Bummer!", ioe)
  }
})
  }
  class AlsException(s: String, e: Exception) extends Exception(s: String, e: 
Exception)
}
{code}


> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-08 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138423#comment-15138423
 ] 

sachin aggarwal commented on SPARK-13172:
-

there are two ways we can proceed :- first use printStackTrace and second to 
use mkString()

1) can be ecapsulated in function  :-
def getStackTraceAsString(t: Throwable) = {
val sw = new StringWriter
t.printStackTrace(new PrintWriter(sw))
sw.toString
}

2) println(t.getStackTrace.mkString("\n"))


mkstring approach give extractly same string as old function getStackTraceString

but the output of first approach is more readable.

try this code to see the difference 

```
import scala.util.{Try, Success, Failure}
import java.io._

object TrySuccessFailure extends App {

  badAdder(3) match {
case Success(i) => println(s"success, i = $i")
case Failure(t) =>
  // this works, but it's not too useful/readable
  println(t.getStackTrace.mkString("\n"))
  println("===")
  println(t.getStackTraceString)
  // this works much better
  val sw = new StringWriter
  t.printStackTrace(new PrintWriter(sw))
  println(sw.toString)
  }
  def badAdder(a: Int): Try[Int] = {
Try({
  val b = a + 1
  if (b == 3) b else {
val ioe = new IOException("Boom!")
throw new AlsException("Bummer!", ioe)
  }
})
  }
  class AlsException(s: String, e: Exception) extends Exception(s: String, e: 
Exception)
}
```


> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-08 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137136#comment-15137136
 ] 

sachin aggarwal commented on SPARK-13172:
-

instead of getStackTraceString should I  use e.getStackTrace or 
e.printStackTrace

> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-05 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133963#comment-15133963
 ] 

sachin aggarwal commented on SPARK-13172:
-

as scala also recommends the same 
http://www.scala-lang.org/api/2.11.1/index.html#scala.runtime.RichException

it should be change , I will go ahead and make the change

> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.

2016-02-05 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133954#comment-15133954
 ] 

sachin aggarwal edited comment on SPARK-13177 at 2/5/16 10:23 AM:
--

Hi [~holdenk]

I will like to work on this, 

thanks
Sachin


was (Author: sachin aggarwal):
Hi [~AlHolden],

I will like to work on this, 

thanks
Sachin

> Update ActorWordCount example to not directly use low level linked list as it 
> is deprecated.
> 
>
> Key: SPARK-13177
> URL: https://issues.apache.org/jira/browse/SPARK-13177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: holdenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.

2016-02-05 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133954#comment-15133954
 ] 

sachin aggarwal commented on SPARK-13177:
-

Hi [~AlHolden],

I will like to work on this, 

thanks
Sachin

> Update ActorWordCount example to not directly use low level linked list as it 
> is deprecated.
> 
>
> Key: SPARK-13177
> URL: https://issues.apache.org/jira/browse/SPARK-13177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: holdenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-02-02 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129802#comment-15129802
 ] 

sachin aggarwal commented on SPARK-13065:
-

happy to see, thats exactly what I have added have a look at this file to see 
how to use new API for java use case :-
https://github.com/agsachin/spark/blob/SPARK-13065/external/twitter/src/test/java/org/apache/spark/streaming/twitter/JavaTwitterStreamSuite.java
 
and for scala check this out 
https://github.com/agsachin/spark/blob/SPARK-13065/external/twitter/src/test/scala/org/apache/spark/streaming/twitter/TwitterStreamSuite.scala

> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
> Attachments: twitterFilterQueryPatch.tar.gz
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
> . . .
>   val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
>   } else {
> newTwitterStream.sample()
>   }
> ...
> rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
> should be able to pass a FilterQueryObject
> looks like an easy fix. See source code links bellow
> kind regards
> Andy
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> $ 2/2/16
> attached is my java implementation for this problem. Feel free to reuse it 
> how ever you like. In my streaming spark app main() I have the following code
>FilterQuery query = config.getFilterQuery().fetch();
> if (query != null) {
> // TODO https://issues.apache.org/jira/browse/SPARK-13065
> tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, 
> query);
> } /*else 
> spark native api
> String[] filters = {"tag1", tag2"}
> tweets = TwitterUtils.createStream(ssc, twitterAuth, filters);
> 
> see 
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> 
> causes
>  val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
> } */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13069) ActorHelper is not throttled by rate limiter

2016-02-02 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129787#comment-15129787
 ] 

sachin aggarwal commented on SPARK-13069:
-

As of Spark 2.0 (not yet released), Spark does not use Akka any more. 

See https://issues.apache.org/jira/browse/SPARK-5293

can you check with latest 2.0 build, to see if similar problem exists.


> ActorHelper is not throttled by rate limiter
> 
>
> Key: SPARK-13069
> URL: https://issues.apache.org/jira/browse/SPARK-13069
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Lin Zhao
>
> The rate an actor receiver sends data to spark is not limited by maxRate or 
> back pressure. Spark would control how fast it writes the data to block 
> manager, but the receiver actor sends events asynchronously and would fill 
> out akka mailbox with millions of events until memory runs out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-02-02 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127935#comment-15127935
 ] 

sachin aggarwal commented on SPARK-13065:
-

[~aedwip] 
I got a doubt after reading ur last comment you mentioned FilterQuery in 
description and here you are addressing twitter4j.query, please clarify.  

> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
> . . .
>   val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
>   } else {
> newTwitterStream.sample()
>   }
> ...
> rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
> should be able to pass a FilterQueryObject
> looks like an easy fix. See source code links bellow
> kind regards
> Andy
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-02-02 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127925#comment-15127925
 ] 

sachin aggarwal commented on SPARK-13065:
-

List of changes:
1) Added support for passing FilterQuery object instead of just Seq of keywords
2) Java had more flexible Api syntax than that of Scala so added similar Api 
syntax for Scala also 
3) added test cases for the all the new Api's 


> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
> . . .
>   val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
>   } else {
> newTwitterStream.sample()
>   }
> ...
> rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
> should be able to pass a FilterQueryObject
> looks like an easy fix. See source code links bellow
> kind regards
> Andy
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-01-31 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125834#comment-15125834
 ] 

sachin aggarwal commented on SPARK-13065:
-

I would like to work on this , will issue a pull request soon..

> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
> . . .
>   val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
>   } else {
> newTwitterStream.sample()
>   }
> ...
> rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
> should be able to pass a FilterQueryObject
> looks like an easy fix. See source code links bellow
> kind regards
> Andy
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()

2015-12-03 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15039767#comment-15039767
 ] 

sachin aggarwal commented on SPARK-12117:
-

Hi,

can u suggest me some  work around to make this use case work in 1.5.1 ?

thanks 

> Column Aliases are Ignored in callUDF while using struct()
> --
>
> Key: SPARK-12117
> URL: https://issues.apache.org/jira/browse/SPARK-12117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: sachin aggarwal
>
> case where this works:
>   val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
> ("Rishabh", "2"))).toDF("myText", "id")
>   
> TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> steps to reproduce error case:
> 1)create a file copy following text--filename(a.json)
> { "myText": "Sachin Aggarwal","id": "1"}
> { "myText": "Rishabh","id": "2"}
> 2)define a simple UDF
> def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}
> 3)register the udf 
>  sqlContext.udf.register("mydef" ,mydef _)
> 4)read the input file 
> val TestDoc2=sqlContext.read.json("/tmp/a.json")
> 5)make a call to UDF
> TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> ERROR received:
> java.lang.IllegalArgumentException: Field "Text" does not exist.
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>  at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
>  at 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
>  at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
>  at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
>  at 
> $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>  at org.apache.spark.scheduler.Task.run(Task.scala:88)

[jira] [Updated] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()

2015-12-02 Thread sachin aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sachin aggarwal updated SPARK-12117:

Description: 
case where this works:
  val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
("Rishabh", "2"))).toDF("myText", "id")
  
TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show


steps to reproduce error case:
1)create a file copy following text--filename(a.json)

{   "myText": "Sachin Aggarwal","id": "1"}
{   "myText": "Rishabh","id": "2"}

2)define a simple UDF
def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}

3)register the udf 
 sqlContext.udf.register("mydef" ,mydef _)

4)read the input file 
val TestDoc2=sqlContext.read.json("/tmp/a.json")

5)make a call to UDF
TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show

ERROR received:
java.lang.IllegalArgumentException: Field "Text" does not exist.
 at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
 at scala.collection.AbstractMap.getOrElse(Map.scala:58)
 at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
 at 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
 at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
 at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
 at 
$line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
 at 
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at 
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
 at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
 at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
 at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
 at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:88)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1177)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 at java.lang.Thread.run(Thread.java:857)

  was:
case where this works:
  val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
("Rishabh", "2"))).toDF("myText", "id")
  
TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show


steps to reproduce:
1)create a file copy following text--filename(a.json)

{   "myText": "Sachin Aggarwal",

[jira] [Updated] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()

2015-12-02 Thread sachin aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sachin aggarwal updated SPARK-12117:

Description: 
case where this works:
  val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
("Rishabh", "2"))).toDF("myText", "id")
  
TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show


steps to reproduce:
1)create a file copy following text--filename(a.json)

{   "myText": "Sachin Aggarwal","id": "1"}
{   "myText": "Rishabh","id": "2"}

2)define a simple UDF
def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}

3)register the udf 
 sqlContext.udf.register("mydef" ,mydef _)

4)read the input file 
val TestDoc2=sqlContext.read.json("/tmp/a.json")

5)make a call to UDF
TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).explain(true)

ERROR received:
java.lang.IllegalArgumentException: Field "Text" does not exist.
 at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
 at scala.collection.AbstractMap.getOrElse(Map.scala:58)
 at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
 at 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
 at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
 at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
 at 
$line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
 at 
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at 
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
 at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
 at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
 at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
 at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:88)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1177)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 at java.lang.Thread.run(Thread.java:857)

  was:
case where this works:
val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
("Rishabh", "2"))).toDF("myText", "id")

TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show


steps to reproduce:
1)create a file copy following text--filename(a.json)

{   "myText": "Sachin Aggarwal",

[jira] [Updated] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()

2015-12-02 Thread sachin aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sachin aggarwal updated SPARK-12117:

Description: 
case where this works:
val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
("Rishabh", "2"))).toDF("myText", "id")

TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show


steps to reproduce:
1)create a file copy following text--filename(a.json)

{   "myText": "Sachin Aggarwal","id": "1"}
{   "myText": "Rishabh","id": "2"}

2)define a simple UDF
def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}

3)register the udf 
 sqlContext.udf.register("mydef" ,mydef _)

4)read the input file 
val TestDoc2=sqlContext.read.json("/tmp/a.json")

5)make a call to UDF
TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).explain(true)

ERROR received:
java.lang.IllegalArgumentException: Field "Text" does not exist.
 at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
 at scala.collection.AbstractMap.getOrElse(Map.scala:58)
 at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
 at 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
 at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
 at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
 at 
$line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
 at 
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at 
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
 at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
 at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
 at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
 at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:88)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1177)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 at java.lang.Thread.run(Thread.java:857)

  was:
case where this works:
val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
("Rishabh", "2"))).toDF("myText", "id")
TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show


steps to reproduce:
1)create a file copy following text--filename(a.json)

{   "myText": "Sachin Aggarwal","i

[jira] [Updated] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()

2015-12-02 Thread sachin aggarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sachin aggarwal updated SPARK-12117:

Description: 
case where this works:
val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
("Rishabh", "2"))).toDF("myText", "id")
TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show


steps to reproduce:
1)create a file copy following text--filename(a.json)

{   "myText": "Sachin Aggarwal","id": "1"}
{   "myText": "Rishabh","id": "2"}

2)define a simple UDF
def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}

3)register the udf 
 sqlContext.udf.register("mydef" ,mydef _)

4)read the input file 
val TestDoc2=sqlContext.read.json("/tmp/a.json")

5)make a call to UDF
TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).explain(true)

ERROR received:
java.lang.IllegalArgumentException: Field "Text" does not exist.
 at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
 at scala.collection.AbstractMap.getOrElse(Map.scala:58)
 at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
 at 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
 at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
 at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
 at 
$line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
 at 
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at 
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
 at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
 at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
 at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
 at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:88)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1177)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 at java.lang.Thread.run(Thread.java:857)

  was:
case where this works:
val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
("Rishabh", "2"))).toDF("myText", "id")
TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show


steps to reproduce:
1)create a file copy following text--filename(a.json)

{   "myText": "Mauricio A. Hernandez",  "id

[jira] [Created] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()

2015-12-02 Thread sachin aggarwal (JIRA)

sachin aggarwal created SPARK-12117:
---

 Summary: Column Aliases are Ignored in callUDF while using struct()
 Key: SPARK-12117
 URL: https://issues.apache.org/jira/browse/SPARK-12117
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: sachin aggarwal


case where this works:
val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
("Rishabh", "2"))).toDF("myText", "id")
TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show


steps to reproduce:
1)create a file copy following text--filename(a.json)

{   "myText": "Mauricio A. Hernandez",  "id": "1"}
{   "myText": "Popa, Lucian",   "id": "2"}

2)define a simple UDF
def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}

3)register the udf 
 sqlContext.udf.register("mydef" ,mydef _)

4)read the input file 
val TestDoc2=sqlContext.read.json("/tmp/a.json")

5)make a call to UDF
TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).explain(true)

ERROR received:
java.lang.IllegalArgumentException: Field "Text" does not exist.
 at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
 at scala.collection.AbstractMap.getOrElse(Map.scala:58)
 at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
 at 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
 at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
 at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
 at 
$line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
 at 
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at 
$line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
 at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
 at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
 at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
 at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
 at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:88)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1177)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 at java.lang.Thread.run(Thread.java:857)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues

[jira] [Commented] (SPARK-11552) Replace example code in ml-decision-tree.md using include_example

2015-11-05 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14993256#comment-14993256
 ] 

sachin aggarwal commented on SPARK-11552:
-

I will start with this

> Replace example code in ml-decision-tree.md using include_example
> -
>
> Key: SPARK-11552
> URL: https://issues.apache.org/jira/browse/SPARK-11552
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

56 matches

Mail list logo