[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994 ] Ran Haim edited comment on SPARK-17436 at 11/20/16 12:06 PM: - Sure , I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. It seems that in spark 2.1 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. In any case I suggest to change this issue to minor. was (Author: ran.h...@optimalplus.com): Sure , I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. In any case I suggest to change this issue to minor. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim >Priority: Minor > > update > *** > It seems that in spark 2.1 code, the sorting issue is resolved. > The sorter does consider inner sorting in the sorting key - but I think it > will be faster to just insert the rows to a list in a hash map. > *** > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994 ] Ran Haim edited comment on SPARK-17436 at 11/20/16 11:30 AM: - Sure , I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. In any case I suggest to change this issue to minor. was (Author: ran.h...@optimalplus.com): Sure - Basically I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. In any case I suggest to change this issue to minor. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994 ] Ran Haim edited comment on SPARK-17436 at 11/20/16 11:29 AM: - Sure - Basically I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. In any case I suggest to change this issue to minor. was (Author: ran.h...@optimalplus.com): Sure. Basically I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679204#comment-15679204 ] Ran Haim edited comment on SPARK-17436 at 11/19/16 12:44 PM: - Hi, When you want to write your data to orc files or perquet files, even if the dataframe is partitioned correctly, you have to tell the writer how to partition the data. This means that when you want to write your data in a partitioned folder you lose sorting, and this is unacceptable when thinking on read performance and data on disk size. I already changed the code locally, and it works as excpeted - but I have no permissions to create a PR, and I do not know how to get it. was (Author: ran.h...@optimalplus.com): Hi, When you want to write your data to orc files or perquet files, even if the dataframe is partitioned correctly, you have to tell the writer how to partition the data. This means that when you want to write your data partitioned you lose sorting, and this is unacceptable when thinking on read performance and data on disk size. I already changed the code locally, and it works as excpeted - but I have no permissions to create a PR, and I do not know how to get it. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674032#comment-15674032 ] Ran Haim edited comment on SPARK-17436 at 11/17/16 3:48 PM: I have basiaclly cloned the repository from https://github.com/apache/spark and ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 clean install" This always fails for mecan you point me to someone who can help me? was (Author: ran.h...@optimalplus.com): I have basiaclly cloned the repository from https://github.com/apache/spark and ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test This always fails for mecan you point me to someone who can help me? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664233#comment-15664233 ] Ran Haim edited comment on SPARK-17436 at 11/14/16 3:45 PM: So... Can you help? Or at least point me to someone who can? BTW On spark-core I get test errors like these: org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite) Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class org.apache Run 2: JavaAPISuite.tearDown:92 NullPointer I think it has something to do with this error that I also see: Error while locating file spark-version-info.properties was (Author: ran.h...@optimalplus.com): So... Can you help? Or at least point me to someone who can? BTW On spark-core I get test errors like these: org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite) Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class org.apache Run 2: JavaAPISuite.tearDown:92 NullPointer > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15661640#comment-15661640 ] Ran Haim edited comment on SPARK-17436 at 11/13/16 3:37 PM: Hi, I only got a chance to work on it now. I saw that the whole class tree got changed - I changed the code in org.apache.spark.sql.execution.datasources.FileFormatWriter. The problem is I cannot seem to run a mvn clean install...A lot of tests fail (not relevant to my change, and happen without it) - And I do want to make sure there are relevant tests (though I did not find any). Any Ideas? Also I cannot create a pull request, I get 403. Ran, was (Author: ran.h...@optimalplus.com): Hi, I only got a chance to work on it now. I saw that the whole class tree got changed - I changed the code in org.apache.spark.sql.execution.datasources.FileFormatWriter. The problem is I cannot seem to run a mvn clean install...A lot of tests fail (not relevant to my change, and happen without it) - And I do want to make sure there are relevant tests (though I did not find any). Any Ideas? Ran, > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611483#comment-15611483 ] Ran Haim edited comment on SPARK-17436 at 10/27/16 10:35 AM: - Usually you partition the data, and then you order it - this way you preserve ordering. The problem here occurs in the writer itself, the DataFrame itself is partitioned and ordered correctly. I would have some time to work on it next week or something like that, can I just do a pull request and put it here? was (Author: ran.h...@optimalplus.com): usually you partition the data, and then you order it - this way you preserve ordering. The problem here occurs in the writer itself, the DataFrame itself is partitioned and ordered correctly. I would have some time to work on it next week or something like that, can I just do a pull request and put it here? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611457#comment-15611457 ] Ran Haim edited comment on SPARK-17436 at 10/27/16 10:32 AM: - Of course it does, every technology that supports partitioning supports ordering in the files themselves Otherwise you just don't provide a good solution for queries. The fix is pretty small, I can work on it myself - how can I do that? was (Author: ran.h...@optimalplus.com): Of course it does, every technology that supports partitioning supports ordering in the files themselves Otherwise you just don't provide a good solutions for queries. The fix is pretty small, I can work on it myself - how can I do that? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611457#comment-15611457 ] Ran Haim edited comment on SPARK-17436 at 10/27/16 10:22 AM: - Of course it does, every technology that supports partitioning supports ordering in the files themselves Otherwise you just don't provide a good solutions for queries. The fix is pretty small, I can work on it myself - how can I do that? was (Author: ran.h...@optimalplus.com): Of course it does, every technology that supports partitioning supports ordering in the files themselves Otherwise you just don't provide good solutions for queries. The fix is pretty small, I can work on it myself - how can I do that? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org