[jira] [Commented] (PARQUET-1739) Make Spark SQL support Column indexes

2020-07-20 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161566#comment-17161566
 ] 

Xinli Shang commented on PARQUET-1739:
--

[~yumwang], Can you share is the implementation is done in Spark to skip 
Parquet pages, as [~gszadovszky] asked that question in Spark-26346? If you 
haven't, I will start looking into it. 

> Make Spark SQL support Column indexes
> -
>
> Key: PARQUET-1739
> URL: https://issues.apache.org/jira/browse/PARQUET-1739
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Make Spark SQL support Column indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1739) Make Spark SQL support Column indexes

2020-07-06 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151859#comment-17151859
 ] 

Gabor Szadovszky commented on PARQUET-1739:
---

Removed 1.11.1 as target release because we are releasing 1.11.1 with an 
independent regression. We will initiate another patch release for 1.11 if it 
is required for the Spark integration.

> Make Spark SQL support Column indexes
> -
>
> Key: PARQUET-1739
> URL: https://issues.apache.org/jira/browse/PARQUET-1739
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Make Spark SQL support Column indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1739) Make Spark SQL support Column indexes

2020-04-15 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084075#comment-17084075
 ] 

Gabor Szadovszky commented on PARQUET-1739:
---

[~yumwang],

Have you succeeded to implement the page skipping mechanism in Spark? Without 
that you may only see the overhead of the column-indexes and not the benefit.
Meanwhile, even if the page skipping is implemented there might be a little 
performance degradation in case of the data is not sorted at all (the min/max 
values are very similar for the different pages). In this case the 
column/offset index reading I/O is the overhead while we cannot drop any pages 
based on the min/max values so we read the same amount of data as we would not 
have column indexes.

>From column index point of view we should not have too much difference between 
>the runs if no ppd is used (no filter is set in the parquet API).

> Make Spark SQL support Column indexes
> -
>
> Key: PARQUET-1739
> URL: https://issues.apache.org/jira/browse/PARQUET-1739
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 1.11.1
>
>
> Make Spark SQL support Column indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1739) Make Spark SQL support Column indexes

2020-04-12 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081827#comment-17081827
 ] 

Yuming Wang commented on PARQUET-1739:
--

[~gszadovszky]  I found that in some cases the performance will be worse:
|Case|Parquet 1.11 Vectorized(ms)|Parquet 1.11 Vectorized(Pushdown)(ms)|Parquet 
1.10 Vectorized(ms)|Parquet 1.10 Vectorized(Pushdown)(ms)|%Improved|
|Select 1 distinct string row (value <=> '100')|6309|1418|7113|528|1.68560606|

> Make Spark SQL support Column indexes
> -
>
> Key: PARQUET-1739
> URL: https://issues.apache.org/jira/browse/PARQUET-1739
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 1.11.1
>
>
> Make Spark SQL support Column indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1739) Make Spark SQL support Column indexes

2020-04-12 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081826#comment-17081826
 ] 

Yuming Wang commented on PARQUET-1739:
--

Spark benchmark result:
|Case|Parquet 1.11 Vectorized(ms)|Parquet 1.11 Vectorized(Pushdown)(ms)|Parquet 
1.10 Vectorized(ms)|Parquet 1.10 Vectorized(Pushdown)(ms)|%Improved|
|Select 0 string row (value IS NULL)|7001|631|8459|569|0.10896309|
|Select 0 string row ('7864320' < value < 
'7864320')|8801|744|9596|470|0.58297872|
|Select 1 string row (value = '7864320')|6973|578|8415|456|0.26754386|
|Select 1 string row (value <=> '7864320')|7090|867|9681|663|0.30769231|
|Select 1 string row ('7864320' <= value <= 
'7864320')|7637|639|8257|442|0.44570136|
|Select all string rows (value IS NOT NULL)|14638|14926|15058|17091|-0.1266749|
|Select 0 int row (value IS NULL)|7233|532|8373|460|0.15652174|
|Select 0 int row (7864320 < value < 7864320)|6474|558|8176|620|-0.1|
|Select 1 int row (value = 7864320)|7284|554|7545|435|0.27356322|
|Select 1 int row (value <=> 7864320)|7109|724|8550|484|0.49586777|
|Select 1 int row (7864320 <= value <= 7864320)|6340|563|7648|440|0.27954545|
|Select 1 int row (7864319 < value < 7864321)|7134|620|7521|435|0.42528736|
|Select 10% int rows (value < 1572864)|7561|1986|8790|1988|-0.001006|
|Select 50% int rows (value < 7864320)|10425|7434|10445|7133|0.04219823|
|Select 90% int rows (value < 14155776)|12130|11745|12959|12574|-0.0659297|
|Select all int rows (value IS NOT NULL)|12662|12961|13640|13794|-0.0603886|
|Select all int rows (value > -1)|12568|12864|13547|13691|-0.0604046|
|Select all int rows (value != -1)|12574|12874|14617|14533|-0.114154|
|Select 0 distinct string row (value IS NULL)|5925|455|7013|371|0.22641509|
|Select 0 distinct string row ('100' < value < 
'100')|6037|445|7087|391|0.13810742|
|Select 1 distinct string row (value = '100')|6107|603|7169|524|0.15076336|
|Select 1 distinct string row (value <=> '100')|6309|1418|7113|528|1.68560606|
|Select 1 distinct string row ('100' <= value <= 
'100')|6224|620|7222|549|0.12932605|
|Select all distinct string rows (value IS NOT 
NULL)|14198|14293|15175|16194|-0.1173892|
|StringStartsWith filter: (value like '10%')|8399|3572|10298|2642|0.35200606|
|StringStartsWith filter: (value like '1000%')|7424|559|7998|441|0.2675737|
|StringStartsWith filter: (value like '786432%')|7554|542|7920|428|0.26635514|
|Select 1 decimal(9, 2) row (value = 7864320)|2684|131|3834|115|0.13913043|
|Select 10% decimal(9, 2) rows (value < 1572864)|4201|2280|5139|2170|0.05069124|
|Select 50% decimal(9, 2) rows (value < 7864320)|8661|8325|9593|10449|-0.203273|
|Select 90% decimal(9, 2) rows (value < 
14155776)|10213|9833|11647|11828|-0.1686676|
|Select 1 decimal(18, 2) row (value = 7864320)|3259|150|4631|133|0.12781955|
|Select 10% decimal(18, 2) rows (value < 
1572864)|4072|1284|5285|1260|0.01904762|
|Select 50% decimal(18, 2) rows (value < 
7864320)|7010|5495|7959|5898|-0.0683282|
|Select 90% decimal(18, 2) rows (value < 
14155776)|10037|9957|10845|10535|-0.0548647|
|Select 1 decimal(38, 2) row (value = 7864320)|4970|151|5943|131|0.15267176|
|Select 10% decimal(38, 2) rows (value < 
1572864)|5912|1605|7079|1827|-0.1215107|
|Select 50% decimal(38, 2) rows (value < 
7864320)|9784|7573|11497|7991|-0.0523088|
|Select 90% decimal(38, 2) rows (value < 
14155776)|13935|13341|14702|14183|-0.0593668|
|InSet -> InFilters (values count: 5, distribution: 
10)|7193|600|8001|495|0.21212121|
|InSet -> InFilters (values count: 5, distribution: 
50)|7002|577|8042|480|0.20208333|
|InSet -> InFilters (values count: 5, distribution: 
90)|7003|587|8526|484|0.21280992|
|InSet -> InFilters (values count: 10, distribution: 
10)|6984|625|8279|519|0.20423892|
|InSet -> InFilters (values count: 10, distribution: 
50)|6949|706|8097|505|0.3980198|
|InSet -> InFilters (values count: 10, distribution: 
90)|7336|613|7961|507|0.20907298|
|InSet -> InFilters (values count: 50, distribution: 
10)|7369|7475|8052|8244|-0.09328|
|InSet -> InFilters (values count: 50, distribution: 
50)|7295|7619|8202|8311|-0.0832631|
|InSet -> InFilters (values count: 50, distribution: 
90)|7584|7610|8405|8326|-0.0859957|
|InSet -> InFilters (values count: 100, distribution: 
10)|7264|7358|8041|8200|-0.1026829|
|InSet -> InFilters (values count: 100, distribution: 
50)|7192|7277|8019|8437|-0.1374896|
|InSet -> InFilters (values count: 100, distribution: 
90)|7040|7236|10567|10681|-0.3225353|
|Select 1 tinyint row (value = CAST(63 AS 
tinyint))|3185|247|4855|235|0.05106383|
|Select 10% tinyint rows (value < CAST(12 AS 
tinyint))|3823|1120|5091|1209|-0.0736146|
|Select 50% tinyint rows (value < CAST(63 AS 
tinyint))|6570|5117|9265|6076|-0.1578341|
|Select 90% tinyint rows (value < CAST(114 AS 
tinyint))|9291|9229|10508|10152|-0.090918|
|Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS 
timestamp))|4054|4757|6253|4774|-0.003561|
|Select 10% timestamp stored