subject:"\[jira\] \[Comment Edited\] \(SPARK\-17436\) dataframe.write sometimes does not keep sorting"

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994
 ] 

Ran Haim edited comment on SPARK-17436 at 11/20/16 12:06 PM:
-

Sure , I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.1 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.



was (Author: ran.h...@optimalplus.com):
Sure , I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.


> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.1 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994
 ] 

Ran Haim edited comment on SPARK-17436 at 11/20/16 11:30 AM:
-

Sure , I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.



was (Author: ran.h...@optimalplus.com):
Sure - Basically I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.


> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994
 ] 

Ran Haim edited comment on SPARK-17436 at 11/20/16 11:29 AM:
-

Sure - Basically I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.



was (Author: ran.h...@optimalplus.com):
Sure.
Basically I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-19 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679204#comment-15679204
 ] 

Ran Haim edited comment on SPARK-17436 at 11/19/16 12:44 PM:
-

Hi,
When you want to write your data to orc files or perquet files,
even if the dataframe is partitioned correctly, you have to tell the writer how 
to partition the data.
This means that when you want to write your data in a partitioned folder you 
lose sorting, and this is unacceptable when thinking on read performance and 
data on disk size.

I already changed the code locally, and it works as excpeted - but I have no 
permissions to create a PR, and I do not know how to get it.


was (Author: ran.h...@optimalplus.com):
Hi,
When you want to write your data to orc files or perquet files,
even if the dataframe is partitioned correctly, you have to tell the writer how 
to partition the data.
This means that when you want to write your data partitioned you lose sorting, 
and this is unacceptable when thinking on read performance and data on disk 
size.

I already changed the code locally, and it works as excpeted - but I have no 
permissions to create a PR, and I do not know how to get it.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-17 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674032#comment-15674032
 ] 

Ran Haim edited comment on SPARK-17436 at 11/17/16 3:48 PM:


I have basiaclly cloned the repository from https://github.com/apache/spark and 
ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 clean install"

This always fails for mecan you point me to someone who can help me?


was (Author: ran.h...@optimalplus.com):
I have basiaclly cloned the repository from https://github.com/apache/spark and 
ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test

This always fails for mecan you point me to someone who can help me?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-14 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664233#comment-15664233
 ] 

Ran Haim edited comment on SPARK-17436 at 11/14/16 3:45 PM:


So...
Can you help?
Or at least point me to someone who can?

BTW On spark-core I get test errors like these:
org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite)
  Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class 
org.apache
  Run 2: JavaAPISuite.tearDown:92 NullPointer

I think it has something to do with this error that I also see:
Error while locating file spark-version-info.properties


was (Author: ran.h...@optimalplus.com):
So...
Can you help?
Or at least point me to someone who can?

BTW On spark-core I get test errors like these:
org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite)
  Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class 
org.apache
  Run 2: JavaAPISuite.tearDown:92 NullPointer

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-13 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15661640#comment-15661640
 ] 

Ran Haim edited comment on SPARK-17436 at 11/13/16 3:37 PM:


Hi,
I only got a chance to work on it now.
I saw that the whole class tree got changed - I changed the code in 
org.apache.spark.sql.execution.datasources.FileFormatWriter.
The problem is I cannot seem to run a mvn clean install...A lot of tests fail 
(not relevant to my change, and happen without it) - And I do want to make sure 
there are relevant tests (though I did not find any).

Any Ideas?

Also I cannot create a pull request, I get 403.

Ran,


was (Author: ran.h...@optimalplus.com):
Hi,
I only got a chance to work on it now.
I saw that the whole class tree got changed - I changed the code in 
org.apache.spark.sql.execution.datasources.FileFormatWriter.
The problem is I cannot seem to run a mvn clean install...A lot of tests fail 
(not relevant to my change, and happen without it) - And I do want to make sure 
there are relevant tests (though I did not find any).

Any Ideas?

Ran,

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-10-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611483#comment-15611483
 ] 

Ran Haim edited comment on SPARK-17436 at 10/27/16 10:35 AM:
-

Usually you partition the data, and then you order it - this way you preserve 
ordering.
The problem here occurs in the writer itself, the DataFrame itself is 
partitioned and ordered correctly.

I would have some time to work on it next week or something like that, can I 
just do a pull request and put it here?



was (Author: ran.h...@optimalplus.com):
usually you partition the data, and then you order it - this way you preserve 
ordering.
The problem here occurs in the writer itself, the DataFrame itself is 
partitioned and ordered correctly.

I would have some time to work on it next week or something like that, can I 
just do a pull request and put it here?


> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-10-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611457#comment-15611457
 ] 

Ran Haim edited comment on SPARK-17436 at 10/27/16 10:32 AM:
-

Of course it does, every technology that supports partitioning supports 
ordering in the files themselves
Otherwise you just don't provide a good solution for queries.

The fix is pretty small, I can work on it myself - how can I do that?


was (Author: ran.h...@optimalplus.com):
Of course it does, every technology that supports partitioning supports 
ordering in the files themselves
Otherwise you just don't provide a good solutions for queries.

The fix is pretty small, I can work on it myself - how can I do that?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-10-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611457#comment-15611457
 ] 

Ran Haim edited comment on SPARK-17436 at 10/27/16 10:22 AM:
-

Of course it does, every technology that supports partitioning supports 
ordering in the files themselves
Otherwise you just don't provide a good solutions for queries.

The fix is pretty small, I can work on it myself - how can I do that?


was (Author: ran.h...@optimalplus.com):
Of course it does, every technology that supports partitioning supports 
ordering in the files themselves
Otherwise you just don't provide good solutions for queries.

The fix is pretty small, I can work on it myself - how can I do that?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

10 matches

Site Navigation

Mail list logo

Footer information