[jira] [Updated] (SPARK-24144) monotonically_increasing_id on streaming dataFrames

2018-05-01 Thread Hemant Bhanawat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemant Bhanawat updated SPARK-24144:

Priority: Major  (was: Minor)

> monotonically_increasing_id on streaming dataFrames
> ---
>
> Key: SPARK-24144
> URL: https://issues.apache.org/jira/browse/SPARK-24144
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Hemant Bhanawat
>Priority: Major
>
> For our use case, we want to assign snapshot ids (incrementing counters) to 
> the incoming records. In case of failures, the same record should get the 
> same id after failure so that the downstream DB can handle the records in a 
> correct manner. 
> We were trying to do this by zipping the streaming rdds with that counter 
> using a modified version of ZippedWithIndexRDD. There are other ways to do 
> that but it turns out all ways are cumbersome and error prone in failure 
> scenarios.
> As suggested on the spark user dev list, one way to do this would be to 
> support monotonically_increasing_id on streaming dataFrames in Spark code 
> base. This would ensure that counters are incrementing for the records of the 
> stream. Also, since the counter can be checkpointed, it would work well in 
> case of failure scenarios. Last but not the least, doing this in spark would 
> be the most performance efficient way.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24144) monotonically_increasing_id on streaming dataFrames

2018-05-01 Thread Hemant Bhanawat (JIRA)
Hemant Bhanawat created SPARK-24144:
---

 Summary: monotonically_increasing_id on streaming dataFrames
 Key: SPARK-24144
 URL: https://issues.apache.org/jira/browse/SPARK-24144
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Hemant Bhanawat


For our use case, we want to assign snapshot ids (incrementing counters) to the 
incoming records. In case of failures, the same record should get the same id 
after failure so that the downstream DB can handle the records in a correct 
manner. 

We were trying to do this by zipping the streaming rdds with that counter using 
a modified version of ZippedWithIndexRDD. There are other ways to do that but 
it turns out all ways are cumbersome and error prone in failure scenarios.

As suggested on the spark user dev list, one way to do this would be to support 
monotonically_increasing_id on streaming dataFrames in Spark code base. This 
would ensure that counters are incrementing for the records of the stream. 
Also, since the counter can be checkpointed, it would work well in case of 
failure scenarios. Last but not the least, doing this in spark would be the 
most performance efficient way.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-24 Thread Hemant Bhanawat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemant Bhanawat reopened SPARK-13693:
-

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-24 Thread Hemant Bhanawat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255881#comment-15255881
 ] 

Hemant Bhanawat commented on SPARK-13693:
-

Latest Jenkins builds are failing with this issue. See:
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2863
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2865

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface

2016-04-19 Thread Hemant Bhanawat (JIRA)
Hemant Bhanawat created SPARK-14729:
---

 Summary: Implement an existing cluster manager with New 
ExternalClusterManager interface
 Key: SPARK-14729
 URL: https://issues.apache.org/jira/browse/SPARK-14729
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Hemant Bhanawat
Priority: Minor


SPARK-13904 adds an ExternalClusterManager interface to Spark to allow external 
cluster managers to spawn Spark components. 

This JIRA tracks following suggestion from [~rxin]: 

'One thing - can you guys try to see if you can implement one of the existing 
cluster managers with this, and then we can make sure this is a proper API? 
Otherwise it is really easy to get removed because it is currently unused by 
anything in Spark.' 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface

2016-04-19 Thread Hemant Bhanawat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247543#comment-15247543
 ] 

Hemant Bhanawat commented on SPARK-14729:
-

I am looking into this. 

> Implement an existing cluster manager with New ExternalClusterManager 
> interface
> ---
>
> Key: SPARK-14729
> URL: https://issues.apache.org/jira/browse/SPARK-14729
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> SPARK-13904 adds an ExternalClusterManager interface to Spark to allow 
> external cluster managers to spawn Spark components. 
> This JIRA tracks following suggestion from [~rxin]: 
> 'One thing - can you guys try to see if you can implement one of the existing 
> cluster managers with this, and then we can make sure this is a proper API? 
> Otherwise it is really easy to get removed because it is currently unused by 
> anything in Spark.' 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-18 Thread Hemant Bhanawat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247096#comment-15247096
 ] 

Hemant Bhanawat commented on SPARK-13904:
-

[~kiszk] Since the builds are passing now, can I assume that it was some 
sporadic issue and close this JIRA?

> Add support for pluggable cluster manager
> -
>
> Key: SPARK-13904
> URL: https://issues.apache.org/jira/browse/SPARK-13904
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>
> Currently Spark allows only a few cluster managers viz Yarn, Mesos and 
> Standalone. But, as Spark is now being used in newer and different use cases, 
> there is a need for allowing other cluster managers to manage spark 
> components. One such use case is - embedding spark components like executor 
> and driver inside another process which may be a datastore. This allows 
> colocation of data and processing. Another requirement that stems from such a 
> use case is that the executors/driver should not take the parent process down 
> when they go down and the components can be relaunched inside the same 
> process again. 
> So, this JIRA requests two functionalities:
> 1. Support for external cluster managers
> 2. Allow a cluster manager to clean up the tasks without taking the parent 
> process down. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-18 Thread Hemant Bhanawat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245581#comment-15245581
 ] 

Hemant Bhanawat commented on SPARK-13904:
-

I ran the following command on my machine 
build/sbt test:compile  -Pyarn -Phadoop-2.6 -Phive -Pkinesis-asl 
-Phive-thriftserver test 
but org.apache.spark.sql.hive.HiveSparkSubmitSuite passed. 

I also reviewed my code of ExternalClusterManagerSuite to ensure that I am 
stopping SparkContext properly. Any ideas what I should be looking at? 

> Add support for pluggable cluster manager
> -
>
> Key: SPARK-13904
> URL: https://issues.apache.org/jira/browse/SPARK-13904
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>
> Currently Spark allows only a few cluster managers viz Yarn, Mesos and 
> Standalone. But, as Spark is now being used in newer and different use cases, 
> there is a need for allowing other cluster managers to manage spark 
> components. One such use case is - embedding spark components like executor 
> and driver inside another process which may be a datastore. This allows 
> colocation of data and processing. Another requirement that stems from such a 
> use case is that the executors/driver should not take the parent process down 
> when they go down and the components can be relaunched inside the same 
> process again. 
> So, this JIRA requests two functionalities:
> 1. Support for external cluster managers
> 2. Allow a cluster manager to clean up the tasks without taking the parent 
> process down. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-17 Thread Hemant Bhanawat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245118#comment-15245118
 ] 

Hemant Bhanawat commented on SPARK-13904:
-

[~kiszk] I am looking into this. 

> Add support for pluggable cluster manager
> -
>
> Key: SPARK-13904
> URL: https://issues.apache.org/jira/browse/SPARK-13904
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>
> Currently Spark allows only a few cluster managers viz Yarn, Mesos and 
> Standalone. But, as Spark is now being used in newer and different use cases, 
> there is a need for allowing other cluster managers to manage spark 
> components. One such use case is - embedding spark components like executor 
> and driver inside another process which may be a datastore. This allows 
> colocation of data and processing. Another requirement that stems from such a 
> use case is that the executors/driver should not take the parent process down 
> when they go down and the components can be relaunched inside the same 
> process again. 
> So, this JIRA requests two functionalities:
> 1. Support for external cluster managers
> 2. Allow a cluster manager to clean up the tasks without taking the parent 
> process down. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13900) Spark SQL queries with OR condition is not optimized properly

2016-03-31 Thread Hemant Bhanawat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221145#comment-15221145
 ] 

Hemant Bhanawat commented on SPARK-13900:
-

As I understand, on table A and B, a nested loop join (that will produce m X n 
rows) is performed and than each row is evaluated to see if any of the 
condition is met. You are asking that Spark should instead do a 
BroadcastHashJoin on the equality conditions in parallel and then union the 
results like you are doing in a different query.

If we leave aside parallelism for a moment, theoretically, time taken for 
nested loop join would vary little when the number of conditions are increased 
while the time taken for the solution that you are suggesting would increase 
linearly with number of conditions. So, when number of conditions are too many, 
nested loop join would be faster than the solution that you suggest. Now the 
question is, how should Spark decide when to do what? 

I think this JIRA can be closed. 

> Spark SQL queries with OR condition is not optimized properly
> -
>
> Key: SPARK-13900
> URL: https://issues.apache.org/jira/browse/SPARK-13900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ashok kumar Rajendran
>
> I have a large table with few billions of rows and have a very small table 
> with 4 dimensional values. All the data is stored in parquet format. I would 
> like to get rows that match any of these dimensions. For example,
> Select field1, field2 from A, B where A.dimension1 = B.dimension1 OR 
> A.dimension2 = B.dimension2 OR A.dimension3 = B.dimension3 OR A.dimension4 = 
> B.dimension4.
> The query plan takes this as BroadcastNestedLoopJoin and executes for very 
> long time.
> If I execute this as Union queries, it takes around 1.5mins for each 
> dimension. Each query internally does BroadcastHashJoin.
> Select field1, field2 from A, B where A.dimension1 = B.dimension1
> UNION ALL
> Select field1, field2 from A, B where A.dimension2 = B.dimension2
> UNION ALL
> Select field1, field2 from A, B where  A.dimension3 = B.dimension3
> UNION ALL
> Select field1, field2 from A, B where  A.dimension4 = B.dimension4.
> This is obviously not an optimal solution as it makes multiple scanning at 
> same table but it gives result much better than OR condition. 
> Seems the SQL optimizer is not working properly which causes huge performance 
> impact on this type of OR query.
> Please correct me if I miss anything here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13904) Add support for pluggable cluster manager

2016-03-15 Thread Hemant Bhanawat (JIRA)
Hemant Bhanawat created SPARK-13904:
---

 Summary: Add support for pluggable cluster manager
 Key: SPARK-13904
 URL: https://issues.apache.org/jira/browse/SPARK-13904
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Hemant Bhanawat


Currently Spark allows only a few cluster managers viz Yarn, Mesos and 
Standalone. But, as Spark is now being used in newer and different use cases, 
there is a need for allowing other cluster managers to manage spark components. 
One such use case is - embedding spark components like executor and driver 
inside another process which may be a datastore. This allows colocation of data 
and processing. Another requirement that stems from such a use case is that the 
executors/driver should not take the parent process down when they go down and 
the components can be relaunched inside the same process again. 

So, this JIRA requests two functionalities:
1. Support for external cluster managers
2. Allow a cluster manager to clean up the tasks without taking the parent 
process down. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org