[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-30 Thread MickDavies
GitHub user MickDavies opened a pull request:

https://github.com/apache/spark/pull/3843

[SPARK-4386] Improve performance when writing Parquet files

Convert type of RowWriteSupport.attributes to Array.

Analysis of performance for writing very wide tables shows that time is 
spent predominantly in apply method on  attributes var. Type of attributes 
previously was LinearSeqOptimized and apply is O(N) which made write O(N 
squared).

Measurements on 575 column table showed this change made a 6x improvement 
in write times.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MickDavies/spark SPARK-4386

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3843.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3843


commit 892519d3bb7166ea184f0c070759b8a3b679e2c4
Author: Michael Davies michael.belldav...@gmail.com
Date:   2014-12-30T13:00:25Z

[SPARK-4386] Improve performance when writing Parquet files

Convert type of RowWriteSupport.attributes to Array.

Analysis of performance for writing very wide tables shows that time is 
spent predominantly in apply method on  attributes var. Type of attributes 
previously was LinearSeqOptimized and apply is O(N) which made write O(N 
squared).

Measurements on 575 column table showed this change showed a 6x improvement 
in write times.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3843#issuecomment-68355445
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-30 Thread MickDavies
Github user MickDavies commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-68355607
  
@jimfcarroll sorry I misunderstood your comment. Good that you have 
verified performance gain.

I have added a PR. It is number 3843.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-30 Thread jimfcarroll
Github user jimfcarroll commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-68361866
  
@MickDavies thanks. I needed the change and was beginning the process of 
profiling again. 5.5 million rows, 2000+ columns took over 15 hours to create a 
Parquet file for me so I incorporated your change when I saw your description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-30 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3843#issuecomment-68387270
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3843#issuecomment-68387404
  
  [Test build #24902 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24902/consoleFull)
 for   PR 3843 at commit 
[`892519d`](https://github.com/apache/spark/commit/892519d3bb7166ea184f0c070759b8a3b679e2c4).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3843#issuecomment-68394993
  
  [Test build #24902 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24902/consoleFull)
 for   PR 3843 at commit 
[`892519d`](https://github.com/apache/spark/commit/892519d3bb7166ea184f0c070759b8a3b679e2c4).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3843#issuecomment-68394997
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24902/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3843


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-30 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3843#issuecomment-68401615
  
Thanks! I'm merging this to master and branch-1.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-29 Thread jimfcarroll
Github user jimfcarroll commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-68294981
  
@MickDavies , I'm not a spark committer so I think they're still looking 
for a PR. I just wanted to let everyone know your improvement is substantial.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-29 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-68308293
  
Yeah, would love to see a PR for this :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-24 Thread MickDavies
Github user MickDavies commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-68053039
  
@jimfcarroll - that's exactly the change I made. Performance improvements 
are very substantial for wide tables, as I said in the case I was looking at 6x 
as fast, but more significant still if you just consider just processing in 
Spark. Thanks for checking in the improvement.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-23 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-67948280
  
@MickDavies Thanks for pinning down the hotspot, looking forward to your PR 
:)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-23 Thread jimfcarroll
Github user jimfcarroll commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-67960092
  
Thanks @MickDavies. I was JUST about to start profiling it again. This is 
the same scala class I originally had issues with. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-12-23 Thread jimfcarroll
Github user jimfcarroll commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-68011659
  
Just an FYI. I changed line 141 of ParquetTableSupport.scala 
(https://github.com/apache/spark/blob/ad42b283246b93654c5fd731cd618fee74d8c4da/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L141)
 from this

```scala
attributes = ParquetTypesConverter.convertFromString(origAttributesStr)
```

to look like this:

```scala
attributes = 
ParquetTypesConverter.convertFromString(origAttributesStr).toArray[Attribute]
```

(I also changed the type of attributes to an Array[Attribute]). As 
@MickDavies said, this seems to have a fairly dramatic affect on the 
performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-11-14 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-63069883
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-11-14 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-63069991
  
Good catch, LGTM, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-11-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-63070785
  
  [Test build #23369 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23369/consoleFull)
 for   PR 3254 at commit 
[`30cc0b5`](https://github.com/apache/spark/commit/30cc0b592789befb7e212783846624a8a4d4381f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-11-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-63081606
  
  [Test build #23369 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23369/consoleFull)
 for   PR 3254 at commit 
[`30cc0b5`](https://github.com/apache/spark/commit/30cc0b592789befb7e212783846624a8a4d4381f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-63081617
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23369/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-11-14 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-63144485
  
This is great, thanks for looking into this!  We haven't done much 
profiling of some of these critical code sections yet.  I wonder if there 
aren't other places where we are being sub-optimal.

In general, I wonder if it isn't a good idea to make sure that in the 
critical parts we convert to  raw `Array`s that have constant time `length` 
functions and lookups (and also avoid function call overhead for both if I 
understand correctly).

I've merged to master and 1.2 to make sure this optimization at least makes 
the next release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-11-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3254


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-11-13 Thread jimfcarroll
GitHub user jimfcarroll opened a pull request:

https://github.com/apache/spark/pull/3254

[SPARK-4386] Improve performance when writing Parquet files.

If you profile the writing of a Parquet file, the single worst time 
consuming call inside of 
org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the 
scala.collection.AbstractSequence.size call. This is because the size call 
actually ends up COUNTING the elements in a 
scala.collection.LinearSeqOptimized.length (optimized?).

This doesn't need to be done. size is called repeatedly where needed 
rather than called once at the top of the method and stored in a 'val'.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jimfcarroll/spark parquet-perf

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3254.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3254


commit 30cc0b592789befb7e212783846624a8a4d4381f
Author: Jim Carroll j...@dontcallme.com
Date:   2014-11-13T20:40:52Z

Improve performance when writing Parquet files.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4386] Improve performance when writing ...

2014-11-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3254#issuecomment-62963142
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org