[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
GitHub user MickDavies opened a pull request: https://github.com/apache/spark/pull/3843 [SPARK-4386] Improve performance when writing Parquet files Convert type of RowWriteSupport.attributes to Array. Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared). Measurements on 575 column table showed this change made a 6x improvement in write times. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MickDavies/spark SPARK-4386 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3843.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3843 commit 892519d3bb7166ea184f0c070759b8a3b679e2c4 Author: Michael Davies michael.belldav...@gmail.com Date: 2014-12-30T13:00:25Z [SPARK-4386] Improve performance when writing Parquet files Convert type of RowWriteSupport.attributes to Array. Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared). Measurements on 575 column table showed this change showed a 6x improvement in write times. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3843#issuecomment-68355445 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user MickDavies commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-68355607 @jimfcarroll sorry I misunderstood your comment. Good that you have verified performance gain. I have added a PR. It is number 3843. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user jimfcarroll commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-68361866 @MickDavies thanks. I needed the change and was beginning the process of profiling again. 5.5 million rows, 2000+ columns took over 15 hours to create a Parquet file for me so I incorporated your change when I saw your description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3843#issuecomment-68387270 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3843#issuecomment-68387404 [Test build #24902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24902/consoleFull) for PR 3843 at commit [`892519d`](https://github.com/apache/spark/commit/892519d3bb7166ea184f0c070759b8a3b679e2c4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3843#issuecomment-68394993 [Test build #24902 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24902/consoleFull) for PR 3843 at commit [`892519d`](https://github.com/apache/spark/commit/892519d3bb7166ea184f0c070759b8a3b679e2c4). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3843#issuecomment-68394997 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24902/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3843 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3843#issuecomment-68401615 Thanks! I'm merging this to master and branch-1.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user jimfcarroll commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-68294981 @MickDavies , I'm not a spark committer so I think they're still looking for a PR. I just wanted to let everyone know your improvement is substantial. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-68308293 Yeah, would love to see a PR for this :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user MickDavies commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-68053039 @jimfcarroll - that's exactly the change I made. Performance improvements are very substantial for wide tables, as I said in the case I was looking at 6x as fast, but more significant still if you just consider just processing in Spark. Thanks for checking in the improvement. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-67948280 @MickDavies Thanks for pinning down the hotspot, looking forward to your PR :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user jimfcarroll commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-67960092 Thanks @MickDavies. I was JUST about to start profiling it again. This is the same scala class I originally had issues with. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user jimfcarroll commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-68011659 Just an FYI. I changed line 141 of ParquetTableSupport.scala (https://github.com/apache/spark/blob/ad42b283246b93654c5fd731cd618fee74d8c4da/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L141) from this ```scala attributes = ParquetTypesConverter.convertFromString(origAttributesStr) ``` to look like this: ```scala attributes = ParquetTypesConverter.convertFromString(origAttributesStr).toArray[Attribute] ``` (I also changed the type of attributes to an Array[Attribute]). As @MickDavies said, this seems to have a fairly dramatic affect on the performance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-63069883 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-63069991 Good catch, LGTM, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-63070785 [Test build #23369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23369/consoleFull) for PR 3254 at commit [`30cc0b5`](https://github.com/apache/spark/commit/30cc0b592789befb7e212783846624a8a4d4381f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-63081606 [Test build #23369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23369/consoleFull) for PR 3254 at commit [`30cc0b5`](https://github.com/apache/spark/commit/30cc0b592789befb7e212783846624a8a4d4381f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-63081617 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23369/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-63144485 This is great, thanks for looking into this! We haven't done much profiling of some of these critical code sections yet. I wonder if there aren't other places where we are being sub-optimal. In general, I wonder if it isn't a good idea to make sure that in the critical parts we convert to raw `Array`s that have constant time `length` functions and lookups (and also avoid function call overhead for both if I understand correctly). I've merged to master and 1.2 to make sure this optimization at least makes the next release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3254 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
GitHub user jimfcarroll opened a pull request: https://github.com/apache/spark/pull/3254 [SPARK-4386] Improve performance when writing Parquet files. If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length (optimized?). This doesn't need to be done. size is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jimfcarroll/spark parquet-perf Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3254.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3254 commit 30cc0b592789befb7e212783846624a8a4d4381f Author: Jim Carroll j...@dontcallme.com Date: 2014-11-13T20:40:52Z Improve performance when writing Parquet files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3254#issuecomment-62963142 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org