[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567221#comment-14567221
 ] 

Sean Owen commented on SPARK-8008:
--

Isnt this what connection pooling is for? Is that an option?

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Rene Treffer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567230#comment-14567230
 ] 

Rene Treffer commented on SPARK-8008:
-

At the moment each partition uses it's own connection as far as I can tell, I 
have to double check how this works on a cluster where even multiple server 
might fetch data.

I'm currently loading year+month wise, due to DB schema (index on actual days, 
locality based on year/month).

I don't think larger batches would be an solution. 3 months may require 160Mio 
rows. I don't think batching that into one partition is a good idea.

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567219#comment-14567219
 ] 

Michael Armbrust commented on SPARK-8008:
-

I'm okay adding documentation about this behavior where ever you think it would 
help, but I would say this is by design.

I'd suggest that if you want lower concurrency use fewer partitions to extract 
the data and then {{repartition}} if you need higher concurrency for subsequent 
operations.

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567227#comment-14567227
 ] 

Michael Armbrust commented on SPARK-8008:
-

I think connection pooling is used primarily to avoid the overhead of making a 
new connection for each operation.  In the case of extracting large amounts of 
data, the user may actually want multiple concurrent connections from the same 
machine.

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567234#comment-14567234
 ] 

Michael Armbrust commented on SPARK-8008:
-

What is the problem with large partitions (as long as you aren't caching them, 
where there is a 2GB limit)?

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Rene Treffer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567240#comment-14567240
 ] 

Rene Treffer commented on SPARK-8008:
-

I've seen very poor performance when streaming it as one partition for example 
(WHERE 1=1). I'll retry with different partition counts.

But I still think there should be a warning about the behavior, as I didn't 
naturally understand that partition count == parallelism in this case (although 
it's logical after some thinking).

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567757#comment-14567757
 ] 

Sean Owen commented on SPARK-8008:
--

I suppose I meant you can block waiting on a new connection after the max is 
hit instead of opening far too many.

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567993#comment-14567993
 ] 

Reynold Xin commented on SPARK-8008:


As discussed on the dev list, there's already warning to avoid high concurrency

{code}
  /**
   * Construct a [[DataFrame]] representing the database table accessible via 
JDBC URL
   * url named table. Partitions of the table will be retrieved in parallel 
based on the parameters
   * passed to this function.
   *
   * Don't create too many partitions in parallel on a large cluster; otherwise 
Spark might crash
   * your external database systems.
   *
   * @param url JDBC database url of the form `jdbc:subprotocol:subname`
   * @param table Name of the table in the external database.
   * @param columnName the name of a column of integral type that will be used 
for partitioning.
   * @param lowerBound the minimum value of `columnName` used to decide 
partition stride
   * @param upperBound the maximum value of `columnName` used to decide 
partition stride
   * @param numPartitions the number of partitions.  the range 
`minValue`-`maxValue` will be split
   *  evenly into this many partitions
   * @param connectionProperties JDBC database connection arguments, a list of 
arbitrary string
   * tag/value. Normally at least a user and 
password property
   * should be included.
   *
   * @since 1.4.0
   */
{code}

Even with the warning, it'd be great to have some way to throttle.


 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567999#comment-14567999
 ] 

Reynold Xin commented on SPARK-8008:


Batching everything into giant partitions have pretty downsides w.r.t. fault 
tolerance, memory consumption, etc. I wouldn't call this a super high priority, 
but if somebody works on a patch we should merge it.



 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org