[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567221#comment-14567221 ] Sean Owen commented on SPARK-8008: -- Isnt this what connection pooling is for? Is that an option? sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567230#comment-14567230 ] Rene Treffer commented on SPARK-8008: - At the moment each partition uses it's own connection as far as I can tell, I have to double check how this works on a cluster where even multiple server might fetch data. I'm currently loading year+month wise, due to DB schema (index on actual days, locality based on year/month). I don't think larger batches would be an solution. 3 months may require 160Mio rows. I don't think batching that into one partition is a good idea. sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567219#comment-14567219 ] Michael Armbrust commented on SPARK-8008: - I'm okay adding documentation about this behavior where ever you think it would help, but I would say this is by design. I'd suggest that if you want lower concurrency use fewer partitions to extract the data and then {{repartition}} if you need higher concurrency for subsequent operations. sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567227#comment-14567227 ] Michael Armbrust commented on SPARK-8008: - I think connection pooling is used primarily to avoid the overhead of making a new connection for each operation. In the case of extracting large amounts of data, the user may actually want multiple concurrent connections from the same machine. sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567234#comment-14567234 ] Michael Armbrust commented on SPARK-8008: - What is the problem with large partitions (as long as you aren't caching them, where there is a 2GB limit)? sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567240#comment-14567240 ] Rene Treffer commented on SPARK-8008: - I've seen very poor performance when streaming it as one partition for example (WHERE 1=1). I'll retry with different partition counts. But I still think there should be a warning about the behavior, as I didn't naturally understand that partition count == parallelism in this case (although it's logical after some thinking). sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567757#comment-14567757 ] Sean Owen commented on SPARK-8008: -- I suppose I meant you can block waiting on a new connection after the max is hit instead of opening far too many. sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567993#comment-14567993 ] Reynold Xin commented on SPARK-8008: As discussed on the dev list, there's already warning to avoid high concurrency {code} /** * Construct a [[DataFrame]] representing the database table accessible via JDBC URL * url named table. Partitions of the table will be retrieved in parallel based on the parameters * passed to this function. * * Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash * your external database systems. * * @param url JDBC database url of the form `jdbc:subprotocol:subname` * @param table Name of the table in the external database. * @param columnName the name of a column of integral type that will be used for partitioning. * @param lowerBound the minimum value of `columnName` used to decide partition stride * @param upperBound the maximum value of `columnName` used to decide partition stride * @param numPartitions the number of partitions. the range `minValue`-`maxValue` will be split * evenly into this many partitions * @param connectionProperties JDBC database connection arguments, a list of arbitrary string * tag/value. Normally at least a user and password property * should be included. * * @since 1.4.0 */ {code} Even with the warning, it'd be great to have some way to throttle. sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567999#comment-14567999 ] Reynold Xin commented on SPARK-8008: Batching everything into giant partitions have pretty downsides w.r.t. fault tolerance, memory consumption, etc. I wouldn't call this a super high priority, but if somebody works on a patch we should merge it. sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org