date:20200118

[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-01-18 Thread Rahul Kumar Challapalli (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018828#comment-17018828
 ] 

Rahul Kumar Challapalli commented on SPARK-30332:
-

If you cannot narrow down the problem, would you be able to provide the dataset 
and logs?

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
> GB_position.lkupIsMoreInArrears,
> GB_position.lkupIsArrearsCapitalised,
> GB_position.lkupCollateralPossession,

[jira] [Resolved] (SPARK-30551) Disable comparison for interval type

2020-01-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30551.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27262
[https://github.com/apache/spark/pull/27262]

> Disable comparison for interval type
> 
>
> Key: SPARK-30551
> URL: https://issues.apache.org/jira/browse/SPARK-30551
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> As we are not going to follow ANSI, it is weird to compare the year-month 
> part to the day-time part for our current implementation of interval. 
> Additionally, the current ordering logic comes from PostgreSQL where the 
> implementation of the interval is messy. And we are not aiming PostgreSQL 
> compliance at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30551) Disable comparison for interval type

2020-01-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30551:
---

Assignee: Kent Yao

> Disable comparison for interval type
> 
>
> Key: SPARK-30551
> URL: https://issues.apache.org/jira/browse/SPARK-30551
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> As we are not going to follow ANSI, it is weird to compare the year-month 
> part to the day-time part for our current implementation of interval. 
> Additionally, the current ordering logic comes from PostgreSQL where the 
> implementation of the interval is messy. And we are not aiming PostgreSQL 
> compliance at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30568) Invalidate interval type as a field table schema

2020-01-18 Thread Kent Yao (Jira)

Kent Yao created SPARK-30568:


 Summary: Invalidate interval type as a field table schema
 Key: SPARK-30568
 URL: https://issues.apache.org/jira/browse/SPARK-30568
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


After this commit 
https://github.com/apache/spark/commit/d67b98ea016e9b714bef68feaac108edd08159c9,
 we are able to create table or alter table with interval column types if the 
external catalog accepts which is varying the interval type's purpose for 
internal usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18455) General support for correlated subquery processing

2020-01-18 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-18455:
---

Assignee: Dilip Biswal

> General support for correlated subquery processing
> --
>
> Key: SPARK-18455
> URL: https://issues.apache.org/jira/browse/SPARK-18455
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Nattavut Sutyanyong
>Assignee: Dilip Biswal
>Priority: Major
> Attachments: SPARK-18455-scoping-doc.pdf
>
>
> Subquery support has been introduced in Spark 2.0. The initial implementation 
> covers the most common subquery use case: the ones used in TPC queries for 
> instance.
> Spark currently supports the following subqueries:
> * Uncorrelated Scalar Subqueries. All cases are supported.
> * Correlated Scalar Subqueries. We only allow subqueries that are aggregated 
> and use equality predicates.
> * Predicate Subqueries. IN or Exists type of queries. We allow most 
> predicates, except when they are pulled from under an Aggregate or Window 
> operator. In that case we only support equality predicates.
> However this does not cover the full range of possible subqueries. This, in 
> part, has to do with the fact that we currently rewrite all correlated 
> subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.
> We currently lack supports for the following use cases:
> * The use of predicate subqueries in a projection.
> * The use of non-equality predicates below Aggregates and or Window operators.
> * The use of non-Aggregate subqueries for correlated scalar subqueries.
> This JIRA aims to lift these current limitations in subquery processing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29679) Make interval type camparable and orderable

2020-01-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29679.
-
Resolution: Won't Do

> Make interval type camparable and orderable
> ---
>
> Key: SPARK-29679
> URL: https://issues.apache.org/jira/browse/SPARK-29679
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> postgres=# select INTERVAL '9 years 1 months -1 weeks -4 days -10 hours -46 
> minutes' > interval '1 s';
>  ?column?
> --
>  t
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-29679) Make interval type camparable and orderable

2020-01-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-29679:
-

> Make interval type camparable and orderable
> ---
>
> Key: SPARK-29679
> URL: https://issues.apache.org/jira/browse/SPARK-29679
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> postgres=# select INTERVAL '9 years 1 months -1 weeks -4 days -10 hours -46 
> minutes' > interval '1 s';
>  ?column?
> --
>  t
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29679) Make interval type camparable and orderable

2020-01-18 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018822#comment-17018822
 ] 

Wenchen Fan commented on SPARK-29679:
-

This has been reverted in https://github.com/apache/spark/pull/27262

> Make interval type camparable and orderable
> ---
>
> Key: SPARK-29679
> URL: https://issues.apache.org/jira/browse/SPARK-29679
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> postgres=# select INTERVAL '9 years 1 months -1 weeks -4 days -10 hours -46 
> minutes' > interval '1 s';
>  ?column?
> --
>  t
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset

2020-01-18 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018820#comment-17018820
 ] 

Wenchen Fan commented on SPARK-30048:
-

This has been reverted in https://github.com/apache/spark/pull/27262

> Enable aggregates with interval type values for RelationalGroupedDataset 
> -
>
> Key: SPARK-30048
> URL: https://issues.apache.org/jira/browse/SPARK-30048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Now the min/max/sum/avg are support for intervals, we should also enable it 
> in RelationalGroupedDataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset

2020-01-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30048.
-
Resolution: Won't Do

> Enable aggregates with interval type values for RelationalGroupedDataset 
> -
>
> Key: SPARK-30048
> URL: https://issues.apache.org/jira/browse/SPARK-30048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Now the min/max/sum/avg are support for intervals, we should also enable it 
> in RelationalGroupedDataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset

2020-01-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-30048:
-

> Enable aggregates with interval type values for RelationalGroupedDataset 
> -
>
> Key: SPARK-30048
> URL: https://issues.apache.org/jira/browse/SPARK-30048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Now the min/max/sum/avg are support for intervals, we should also enable it 
> in RelationalGroupedDataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28531) Improve Extract Python UDFs optimizer rule to enforce idempotence

2020-01-18 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018818#comment-17018818
 ] 

Xiao Li commented on SPARK-28531:
-

[~mauzhang] Feel free to submit a PR

> Improve Extract Python UDFs optimizer rule to enforce idempotence
> -
>
> Key: SPARK-28531
> URL: https://issues.apache.org/jira/browse/SPARK-28531
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30567) setDelegateCatalog should be called if catalog has implemented CatalogExtension

2020-01-18 Thread yu jiantao (Jira)

yu jiantao created SPARK-30567:
--

 Summary: setDelegateCatalog should be called if catalog has 
implemented CatalogExtension
 Key: SPARK-30567
 URL: https://issues.apache.org/jira/browse/SPARK-30567
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: yu jiantao
 Fix For: 3.0.0


CatalogManager.catalog calls Catalogs.load to load a catalog if it is not 
'spark_catalog' . If the catalog has implemented CatalogExtension, 
setDelegateCatalog is not called when the catalog is loaded, which is not like 
that we have done for v2SessionCatalog, and that makes a confusion for 
customized session catalog, like iceberg SparkSessionCatalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30566) Iterator doesn't refer outer identifier named "iterator" properly in Scala 2.13

2020-01-18 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-30566:
---
Parent: SPARK-25075
Issue Type: Sub-task  (was: Bug)

> Iterator doesn't refer outer identifier named "iterator" properly in Scala 
> 2.13
> ---
>
> Key: SPARK-30566
> URL: https://issues.apache.org/jira/browse/SPARK-30566
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
> Environment: Scala 2.13
>Reporter: Kousuke Saruta
>Priority: Minor
>
> As of Scala 2.13, scala.collection.Iterator has "iterator" method so if an 
> inner class of Iterator means to refer an outer identifier named "iterator", 
> it does not work as we think.
> Following is an example.
> {code}
> val iterator = ...
> return new Iterator {
>   def next() {
>  iterator.next() // this "iterator" is not what we defined above.
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30566) Iterator doesn't refer outer identifier named "iterator" properly in Scala 2.13

2020-01-18 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-30566:
--

 Summary: Iterator doesn't refer outer identifier named "iterator" 
properly in Scala 2.13
 Key: SPARK-30566
 URL: https://issues.apache.org/jira/browse/SPARK-30566
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
 Environment: Scala 2.13
Reporter: Kousuke Saruta


As of Scala 2.13, scala.collection.Iterator has "iterator" method so if an 
inner class of Iterator means to refer an outer identifier named "iterator", it 
does not work as we think.
Following is an example.

{code}
val iterator = ...

return new Iterator {

  def next() {
 iterator.next() // this "iterator" is not what we defined above.
  }
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30565) Regression in the ORC benchmark

2020-01-18 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30565:
--

 Summary: Regression in the ORC benchmark
 Key: SPARK-30565
 URL: https://issues.apache.org/jira/browse/SPARK-30565
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


New benchmark results generated in the PR 
[https://github.com/apache/spark/pull/27078] show regression ~3 times.

Before:
{code}
Hive built-in ORC   520531  
 8  2.0 495.8   0.6X
{code}
https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dL138
After:
{code}
Hive built-in ORC  1761   1792  
43  0.61679.3   0.1X
{code}
https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dR138



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30564) Regression in the wide schema benchmark

2020-01-18 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018653#comment-17018653
 ] 

Maxim Gekk commented on SPARK-30564:


[~viirya] Please, take a look at this. Maybe it can be interesting for you.

> Regression in the wide schema benchmark
> ---
>
> Key: SPARK-30564
> URL: https://issues.apache.org/jira/browse/SPARK-30564
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> New results of WideSchemaBenchmark generated in the PR: 
> https://github.com/apache/spark/pull/27078 show regressions up to 2 times.
> Before:
> {code}
> 2500 select expressions103 /  107  0.0   
> 102962705.0   0.1X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332L11
> After:
> {code}
> 2500 select expressions 211214
>4  0.0   210927791.0   0.0X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332R11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30564) Regression in the wide schema benchmark

2020-01-18 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018652#comment-17018652
 ] 

Maxim Gekk commented on SPARK-30564:


Here, regression is ~8 times 
[https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332R74]

> Regression in the wide schema benchmark
> ---
>
> Key: SPARK-30564
> URL: https://issues.apache.org/jira/browse/SPARK-30564
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> New results of WideSchemaBenchmark generated in the PR: 
> https://github.com/apache/spark/pull/27078 show regressions up to 2 times.
> Before:
> {code}
> 2500 select expressions103 /  107  0.0   
> 102962705.0   0.1X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332L11
> After:
> {code}
> 2500 select expressions 211214
>4  0.0   210927791.0   0.0X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332R11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30564) Regression in the wide schema benchmark

2020-01-18 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30564:
--

 Summary: Regression in the wide schema benchmark
 Key: SPARK-30564
 URL: https://issues.apache.org/jira/browse/SPARK-30564
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


New results of WideSchemaBenchmark generated in the PR: 
https://github.com/apache/spark/pull/27078 show regressions up to 2 times.
Before:
{code}
2500 select expressions103 /  107  0.0   
102962705.0   0.1X
{code}
https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332L11
After:
{code}
2500 select expressions 211214  
 4  0.0   210927791.0   0.0X
{code}
https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332R11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30563) Regressions in Join benchmarks

2020-01-18 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30563:
--

 Summary: Regressions in Join benchmarks
 Key: SPARK-30563
 URL: https://issues.apache.org/jira/browse/SPARK-30563
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Regenerated benchmark results in the https://github.com/apache/spark/pull/27078 
shows many regressions in JoinBenchmark. The benchmarked queries slowed down by 
up to 3 times, see
old results:
https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10
new results:
https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10

One of the difference in queries is using the `NoOp` datasource in new queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30562) Regression in interval string parsing

2020-01-18 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30562:
--

 Summary: Regression in interval string parsing
 Key: SPARK-30562
 URL: https://issues.apache.org/jira/browse/SPARK-30562
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Previously:
  11 units w/o interval - 1972.8 ns per row
Regenerated results in the PR https://github.com/apache/spark/pull/27078:
  11 units w/o interval - 3272.6 ns per row

Regression is by 66%, see 
https://github.com/apache/spark/pull/27078/files#diff-586487fac2b9b1303aaf80adf8fa37abR28




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30561) start spark applications without a 30second startup penalty

2020-01-18 Thread t oo (Jira)

t oo created SPARK-30561:


 Summary: start spark applications without a 30second startup 
penalty
 Key: SPARK-30561
 URL: https://issues.apache.org/jira/browse/SPARK-30561
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: t oo


see 
https://stackoverflow.com/questions/57610138/how-to-start-spark-applications-without-a-30second-startup-penalty

using spark standalone.

There are several sleeps that can be removed:

grep -i 'sleep(' -R * | grep -v 'src/test/' | grep -E '^core' | grep -ivE 
'mesos|yarn|python|HistoryServer|spark/ui/'
core/src/main/scala/org/apache/spark/util/Clock.scala:  
Thread.sleep(sleepTime)
core/src/main/scala/org/apache/spark/SparkContext.scala:   * sc.parallelize(1 
to 1, 2).map { i => Thread.sleep(10); i }.count()
core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala:  private 
def delay(secs: Duration = 5.seconds) = Thread.sleep(secs.toMillis)
core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala:  
Thread.sleep(1000)
core/src/main/scala/org/apache/spark/deploy/Client.scala:Thread.sleep(5000)
core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala:  
Thread.sleep(100)
core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala:  
Thread.sleep(duration)
core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:def 
sleep(seconds: Int): Unit = (0 until seconds).takeWhile { _ =>
core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:  
Thread.sleep(1000)
core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:
sleeper.sleep(waitSeconds)
core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:  def 
sleep(seconds: Int): Unit
core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionClient.scala:
  Thread.sleep(REPORT_DRIVER_STATUS_INTERVAL)
core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala:  
Thread.sleep(10)
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:  
Thread.sleep(SLEEP_TIME_SECS * 1000L)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30560) allow driver to consume a fractional core

2020-01-18 Thread t oo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

t oo updated SPARK-30560:
-
Description: 
see 
https://stackoverflow.com/questions/56781927/apache-spark-standalone-scheduler-why-does-driver-need-a-whole-core-in-cluste

this is to make it possible for a driver to use 0.2 cores rather than a whole 
core

Standard CPUs, no GPUs

  was:
see 
https://stackoverflow.com/questions/56781927/apache-spark-standalone-scheduler-why-does-driver-need-a-whole-core-in-cluste

this is to make it possible for a driver to use 0.2 cores rather than a whole 
core


> allow driver to consume a fractional core
> -
>
> Key: SPARK-30560
> URL: https://issues.apache.org/jira/browse/SPARK-30560
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.4
>Reporter: t oo
>Priority: Minor
>
> see 
> https://stackoverflow.com/questions/56781927/apache-spark-standalone-scheduler-why-does-driver-need-a-whole-core-in-cluste
> this is to make it possible for a driver to use 0.2 cores rather than a whole 
> core
> Standard CPUs, no GPUs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30560) allow driver to consume a fractional core

2020-01-18 Thread t oo (Jira)

t oo created SPARK-30560:


 Summary: allow driver to consume a fractional core
 Key: SPARK-30560
 URL: https://issues.apache.org/jira/browse/SPARK-30560
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.4.4
Reporter: t oo


see 
https://stackoverflow.com/questions/56781927/apache-spark-standalone-scheduler-why-does-driver-need-a-whole-core-in-cluste

this is to make it possible for a driver to use 0.2 cores rather than a whole 
core



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

2020-01-18 Thread Ori Popowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ori Popowski updated SPARK-30559:
-
Description: 
In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and 
INFER_AND_SAVE do not work as intended. They were supposed to infer a 
case-sensitive schema from the underlying files, but they do not work.
 # INFER_ONLY never works: it will always user lowercase column names from Hive 
metastore schema
 # INFER_AND_SAVE only works the second time {{spark.sql("SELECT …")}} is 
called (the first time it writes the schema to TBLPROPERTIES in the metastore 
and subsequent calls read that schema, so they do work)

h3. Expected behavior (according to SPARK-19611)

INFER_ONLY - infer the schema from the underlying files

INFER_AND_SAVE - infer the schema from the underlying files, save it to the 
metastore, and read it from the metastore on any subsequent calls
h2. Reproduce
h3. Prepare the data
h4. 1) Create a Parquet file
{code:scala}
scala> List(("a", 1), ("b", 2)).toDF("theString", 
"theNumber").write.parquet("hdfs:///t"){code}
 
h4. 2) Inspect the Parquet files
{code:sh}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-0-….snappy.parquet
{"theString":"a","theNumber":1}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-1-….snappy.parquet
{"theString":"b","theNumber":2}{code}
We see that they are saved with camelCase column names.
h4. 3) Create a Hive table 
{code:sql}
hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
 > ROW FORMAT SERDE 
 > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 > STORED AS INPUTFORMAT 
 > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
 > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
 > LOCATION 'hdfs:///t';{code}
 
h3. Reproduce INFER_ONLY bug
h4. 3) Read the table in Spark using INFER_ONLY
{code:sh}
$ spark-shell --master local[*] --conf 
spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
{code:scala}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
thestring
thenumber
{code}
h4. Conclusion

When INFER_ONLY is set, column names are lowercase always.
h3. Reproduce INFER_AND_SAVE bug
h4. 1) Run the for first time
{code:sh}
$ spark-shell --master local[*] --conf 
spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code}
{code:scala}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
thestring
thenumber{code}
We see that column names are lowercase
h4. 2) Run for the second time
{code:scala}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
theString
theNumber{code}
We see that the column names are camelCase
h4. Conclusion

When INFER_AND_SAVE is set, column names are lowercase on first call and 
camelCase on subsquent calls.

 

 

  was:
In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and 
INFER_AND_SAVE do not work as intended. They were supposed to infer a 
case-sensitive schema from the underlying files, but they do not work.
 # INFER_ONLY never works: it will always user lowercase column names from Hive 
metastore schema
 # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called 
(the first time it writes the schema to TBLPROPERTIES in the metastore and 
subsequent calls read that schema, so they do work)

h3. Expected behavior (according to SPARK-19611)

INFER_ONLY - infer the schema from the underlying files

INFER_AND_SAVE - infer the schema from the underlying files, save it to the 
metastore, and read it from the metastore on any subsequent calls
h2. Reproduce
h3. Prepare the data
h4. 1) Create a Parquet file
{code:scala}
scala> List(("a", 1), ("b", 2)).toDF("theString", 
"theNumber").write.parquet("hdfs:///t"){code}
 
h4. 2) Inspect the Parquet files
{code:sh}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-0-….snappy.parquet
{"theString":"a","theNumber":1}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-1-….snappy.parquet
{"theString":"b","theNumber":2}{code}
We see that they are saved with camelCase column names.
h4. 3) Create a Hive table 
{code:sql}
hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
 > ROW FORMAT SERDE 
 > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 > STORED AS INPUTFORMAT 
 > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
 > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
 > LOCATION 'hdfs:///t';{code}
 
h3. Reproduce INFER_ONLY bug
h4. 3) Read the table in Spark using INFER_ONLY
{code:sh}
$ spark-shell --master local[*] --conf 
spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
{code:sh}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)

thestring
thenumber
{code}
h4. Conclusion

When INFER_ONLY is set, column names are lowercase always.
h3. Reproduce INFER_AND_SAVE bug
h4. 1) Run the

[jira] [Updated] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

2020-01-18 Thread Ori Popowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ori Popowski updated SPARK-30559:
-
Description: 
In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and 
INFER_AND_SAVE do not work as intended. They were supposed to infer a 
case-sensitive schema from the underlying files, but they do not work.
 # INFER_ONLY never works: it will always user lowercase column names from Hive 
metastore schema
 # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called 
(the first time it writes the schema to TBLPROPERTIES in the metastore and 
subsequent calls read that schema, so they do work)

h3. Expected behavior (according to SPARK-19611)

INFER_ONLY - infer the schema from the underlying files

INFER_AND_SAVE - infer the schema from the underlying files, save it to the 
metastore, and read it from the metastore on any subsequent calls
h2. Reproduce
h3. Prepare the data
h4. 1) Create a Parquet file
{code:scala}
scala> List(("a", 1), ("b", 2)).toDF("theString", 
"theNumber").write.parquet("hdfs:///t"){code}
 
h4. 2) Inspect the Parquet files
{code:sh}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-0-….snappy.parquet
{"theString":"a","theNumber":1}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-1-….snappy.parquet
{"theString":"b","theNumber":2}{code}
We see that they are saved with camelCase column names.
h4. 3) Create a Hive table 
{code:sql}
hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
 > ROW FORMAT SERDE 
 > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 > STORED AS INPUTFORMAT 
 > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
 > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
 > LOCATION 'hdfs:///t';{code}
 
h3. Reproduce INFER_ONLY bug
h4. 3) Read the table in Spark using INFER_ONLY
{code:sh}
$ spark-shell --master local[*] --conf 
spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
{code:sh}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)

thestring
thenumber
{code}
h4. Conclusion

When INFER_ONLY is set, column names are lowercase always.
h3. Reproduce INFER_AND_SAVE bug
h4. 1) Run the for first time
{code:sh}
$ spark-shell --master local[*] --conf 
spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code}
{code:scala}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
thestring
thenumber{code}
We see that column names are lowercase
h4. 2) Run for the second time
{code:scala}
scala> spark.sql("select * from default.t").columns.foreach(println)
theString
theNumber{code}
We see that the column names are camelCase
h4. Conclusion

When INFER_AND_SAVE is set, column names are lowercase on first call and 
camelCase on subsquent calls.

 

 

  was:
In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and 
INFER_AND_SAVE do not work as intended. They were supposed to infer a 
case-sensitive schema from the underlying files, but they do not work.
 # INFER_ONLY never works: it will always user lowercase column names from Hive 
metastore schema
 # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called 
(the first time it writes the schema to TBLPROPERTIES in the metastore and 
subsequent calls read that schema, so they do work)

h3. Expected behavior (according to SPARK-19611)

INFER_ONLY - infer the schema from the underlying files

INFER_AND_SAVE - infer the schema from the underlying files, save it to the 
metastore, and read it from the metastore on any subsequent calls
h2. Reproduce
h3. Prepare the data
h4. 1) Create a Parquet file
{code:java}
scala> List(("a", 1), ("b", 2)).toDF("theString", 
"theNumber").write.parquet("hdfs:///t"){code}
 
h4. 2) Inspect the Parquet files
{code:java}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-0-….snappy.parquet
{"theString":"a","theNumber":1}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-1-….snappy.parquet
{"theString":"b","theNumber":2}{code}
We see that they are saved with camelCase column names.
h4. 3) Create a Hive table 
{code:java}
hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
 > ROW FORMAT SERDE 
 > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 > STORED AS INPUTFORMAT 
 > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
 > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
 > LOCATION 'hdfs:///t';{code}
 
h3. Reproduce INFER_ONLY bug
h4. 3) Read the table in Spark using INFER_ONLY
{code:java}
$ spark-shell --master local[*] --conf 
spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
{code:java}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)

thestring
thenumber
{code}
h4. Conclusion

When INFER_ONLY is set, column names are lowercase always.
h3. Reproduce INFER_AND_SAVE bug
h4. 1) Run the

[jira] [Updated] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

2020-01-18 Thread Ori Popowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ori Popowski updated SPARK-30559:
-
Description: 
In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and 
INFER_AND_SAVE do not work as intended. They were supposed to infer a 
case-sensitive schema from the underlying files, but they do not work.
 # INFER_ONLY never works: it will always user lowercase column names from Hive 
metastore schema
 # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called 
(the first time it writes the schema to TBLPROPERTIES in the metastore and 
subsequent calls read that schema, so they do work)

h3. Expected behavior (according to SPARK-19611)

INFER_ONLY - infer the schema from the underlying files

INFER_AND_SAVE - infer the schema from the underlying files, save it to the 
metastore, and read it from the metastore on any subsequent calls
h2. Reproduce
h3. Prepare the data
h4. 1) Create a Parquet file
{code:java}
scala> List(("a", 1), ("b", 2)).toDF("theString", 
"theNumber").write.parquet("hdfs:///t"){code}
 
h4. 2) Inspect the Parquet files
{code:java}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-0-….snappy.parquet
{"theString":"a","theNumber":1}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-1-….snappy.parquet
{"theString":"b","theNumber":2}{code}
We see that they are saved with camelCase column names.
h4. 3) Create a Hive table 
{code:java}
hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
 > ROW FORMAT SERDE 
 > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 > STORED AS INPUTFORMAT 
 > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
 > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
 > LOCATION 'hdfs:///t';{code}
 
h3. Reproduce INFER_ONLY bug
h4. 3) Read the table in Spark using INFER_ONLY
{code:java}
$ spark-shell --master local[*] --conf 
spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
{code:java}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)

thestring
thenumber
{code}
h4. Conclusion

When INFER_ONLY is set, column names are lowercase always.
h3. Reproduce INFER_AND_SAVE bug
h4. 1) Run the for first time
{code:java}
$ spark-shell --master local[*] --conf 
spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code}
{code:java}
scala> spark.sql("select * from default.t").columns.foreach(println)
thestring
thenumber{code}
We see that column names are lowercase
h4. 2) Run for the second time
{code:java}
scala> spark.sql("select * from default.t").columns.foreach(println)
theString
theNumber{code}
We see that the column names are camelCase
h4. Conclusion

When INFER_AND_SAVE is set, column names are lowercase on first call and 
camelCase on subsquent calls.

 

 

  was:
Spark SQL's spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY and 
INFER_AND_SAVE do not work as intended. They were supposed to infer a 
case-sensitive schema from the underlying files, but they do not work.
 # INFER_ONLY never works: it will always user lowercase column names from Hive 
metastore schema
 # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called 
(the first time it writes the schema to TBLPROPERTIES in the metastore and 
subsequent calls read that schema, so they do work)

h3. Expected behavior (according to SPARK-19611)

INFER_ONLY - infer the schema from the underlying files

INFER_AND_SAVE - infer the schema from the underlying files, save it to the 
metastore, and read it from the metastore on any subsequent calls
h2. Reproduce
h3. Prepare the data
h4. 1) Create a Parquet file

 
{code:java}
scala> List(("a", 1), ("b", 2)).toDF("theString", 
"theNumber").write.parquet("hdfs:///t"){code}
 
h4. 2) Inspect the Parquet files

 
{code:java}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-0-….snappy.parquet
{"theString":"a","theNumber":1}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-1-….snappy.parquet
{"theString":"b","theNumber":2}{code}
 

We see that they are saved with camelCase column names.
h4. 3) Create a Hive table

 
{code:java}
hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
 > ROW FORMAT SERDE 
 > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 > STORED AS INPUTFORMAT 
 > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
 > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
 > LOCATION 'hdfs:///t';{code}
 
h3. Reproduce INFER_ONLY bug

 
h4. 3) Read the table in Spark using INFER_ONLY

 
{code:java}
$ spark-shell --master local[*] --conf 
spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
 

 
{code:java}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)

thestring
thenumber
{code}
h4. Conclusion

When INFER_ONLY is set, column names are lowercase always.
h3. Reproduce

[jira] [Created] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

2020-01-18 Thread Ori Popowski (Jira)

Ori Popowski created SPARK-30559:


 Summary: Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode 
does not work with Hive
 Key: SPARK-30559
 URL: https://issues.apache.org/jira/browse/SPARK-30559
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
 Environment: EMR 28.1 with Spark 2.4.4, Hadoop 2.8.5 and Hive 2.3.6
Reporter: Ori Popowski


Spark SQL's spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY and 
INFER_AND_SAVE do not work as intended. They were supposed to infer a 
case-sensitive schema from the underlying files, but they do not work.
 # INFER_ONLY never works: it will always user lowercase column names from Hive 
metastore schema
 # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called 
(the first time it writes the schema to TBLPROPERTIES in the metastore and 
subsequent calls read that schema, so they do work)

h3. Expected behavior (according to SPARK-19611)

INFER_ONLY - infer the schema from the underlying files

INFER_AND_SAVE - infer the schema from the underlying files, save it to the 
metastore, and read it from the metastore on any subsequent calls
h2. Reproduce
h3. Prepare the data
h4. 1) Create a Parquet file

 
{code:java}
scala> List(("a", 1), ("b", 2)).toDF("theString", 
"theNumber").write.parquet("hdfs:///t"){code}
 
h4. 2) Inspect the Parquet files

 
{code:java}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-0-….snappy.parquet
{"theString":"a","theNumber":1}
$ hadoop jar parquet-tools-1.11.0.jar cat -j 
hdfs:///t/part-1-….snappy.parquet
{"theString":"b","theNumber":2}{code}
 

We see that they are saved with camelCase column names.
h4. 3) Create a Hive table

 
{code:java}
hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
 > ROW FORMAT SERDE 
 > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 > STORED AS INPUTFORMAT 
 > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
 > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
 > LOCATION 'hdfs:///t';{code}
 
h3. Reproduce INFER_ONLY bug

 
h4. 3) Read the table in Spark using INFER_ONLY

 
{code:java}
$ spark-shell --master local[*] --conf 
spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
 

 
{code:java}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)

thestring
thenumber
{code}
h4. Conclusion

When INFER_ONLY is set, column names are lowercase always.
h3. Reproduce INFER_AND_SAVE bug
h4. 1) Run the for first time
{code:java}
$ spark-shell --master local[*] --conf 
spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code}
 

 
{code:java}
scala> spark.sql("select * from default.t").columns.foreach(println)
thestring
thenumber{code}
 

We see that column names are lowercase
h4. 2) Run for the second time

 
{code:java}
scala> spark.sql("select * from default.t").columns.foreach(println)
theString
theNumber{code}
 

We see that the column names are camelCase
h4. Conclusion

When INFER_AND_SAVE is set, column names are lowercase on first call and 
camelCase on subsquent calls.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30558) Avoid rebuilding `AvroOptions` per each partition

2020-01-18 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30558:
--

 Summary: Avoid rebuilding `AvroOptions` per each partition
 Key: SPARK-30558
 URL: https://issues.apache.org/jira/browse/SPARK-30558
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, an instance of `AvroOptions` is created per each partition. This can 
be avoid by building it only once and pass to `AvroScan`. See 
https://github.com/apache/spark/pull/27174#discussion_r365596481 for more 
details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service

2020-01-18 Thread t oo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018544#comment-17018544
 ] 

t oo commented on SPARK-27750:
--

did some digging, let me know if I'm on right track.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L785
 ---> why precedence comment? i want the opposite

in private def schedule(): Unit = {
add:
rawFreeCores = shuffledAliveWorkers.map(_.coresFree).sum
cores_reserved_for_apps = *get from config somehow* ie 8
forDriversFreeCores = math.max(rawFreeCores-cores_reserved_for_apps,0)
then wrap the inner steps in:
if forDriversFreeCores >= driver.desc.cores
{
canLaunchDriver
}
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L796

#just note about driver/exec requests
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L746-L775



> Standalone scheduler - ability to prioritize applications over drivers, many 
> drivers act like Denial of Service
> ---
>
> Key: SPARK-27750
> URL: https://issues.apache.org/jira/browse/SPARK-27750
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: t oo
>Priority: Minor
>
> If I submit 1000 spark submit drivers then they consume all the cores on my 
> cluster (essentially it acts like a Denial of Service) and no spark 
> 'application' gets to run since the cores are all consumed by the 'drivers'. 
> This feature is about having the ability to prioritize applications over 
> drivers so that at least some 'applications' can start running. I guess it 
> would be like: If (driver.state = 'submitted' and (exists some app.state = 
> 'submitted')) then set app.state = 'running'
> if all apps have app.state = 'running' then set driver.state = 'submitted' 
>  
> Secondary to this, why must a driver consume a minimum of 1 entire core?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30539) DataFrame.tail in PySpark API

2020-01-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30539:
-

Assignee: Hyukjin Kwon

> DataFrame.tail in PySpark API
> -
>
> Key: SPARK-30539
> URL: https://issues.apache.org/jira/browse/SPARK-30539
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> SPARK-30185 added DataFrame.tail API. It should be good for PySpark side to 
> have it too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30539) DataFrame.tail in PySpark API

2020-01-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30539.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27251
[https://github.com/apache/spark/pull/27251]

> DataFrame.tail in PySpark API
> -
>
> Key: SPARK-30539
> URL: https://issues.apache.org/jira/browse/SPARK-30539
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-30185 added DataFrame.tail API. It should be good for PySpark side to 
> have it too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30544) Upgrade Genjavadoc to 0.15

2020-01-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30544.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27255
[https://github.com/apache/spark/pull/27255]

> Upgrade Genjavadoc to 0.15
> --
>
> Key: SPARK-30544
> URL: https://issues.apache.org/jira/browse/SPARK-30544
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.0
>
>
> Genjavadoc 0.14 doesn't support Scala so sbt -Pscala-2.13 will fail to build.
> Let's upgrade it to 0.15.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

[jira] [Resolved] (SPARK-30551) Disable comparison for interval type

[jira] [Assigned] (SPARK-30551) Disable comparison for interval type

[jira] [Created] (SPARK-30568) Invalidate interval type as a field table schema

[jira] [Assigned] (SPARK-18455) General support for correlated subquery processing

[jira] [Resolved] (SPARK-29679) Make interval type camparable and orderable

[jira] [Reopened] (SPARK-29679) Make interval type camparable and orderable

[jira] [Commented] (SPARK-29679) Make interval type camparable and orderable

[jira] [Commented] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset

[jira] [Resolved] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset

[jira] [Reopened] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset

[jira] [Commented] (SPARK-28531) Improve Extract Python UDFs optimizer rule to enforce idempotence

[jira] [Created] (SPARK-30567) setDelegateCatalog should be called if catalog has implemented CatalogExtension

[jira] [Updated] (SPARK-30566) Iterator doesn't refer outer identifier named "iterator" properly in Scala 2.13

[jira] [Created] (SPARK-30566) Iterator doesn't refer outer identifier named "iterator" properly in Scala 2.13

[jira] [Created] (SPARK-30565) Regression in the ORC benchmark

[jira] [Commented] (SPARK-30564) Regression in the wide schema benchmark

[jira] [Commented] (SPARK-30564) Regression in the wide schema benchmark

[jira] [Created] (SPARK-30564) Regression in the wide schema benchmark

[jira] [Created] (SPARK-30563) Regressions in Join benchmarks

[jira] [Created] (SPARK-30562) Regression in interval string parsing

[jira] [Created] (SPARK-30561) start spark applications without a 30second startup penalty

[jira] [Updated] (SPARK-30560) allow driver to consume a fractional core

[jira] [Created] (SPARK-30560) allow driver to consume a fractional core

[jira] [Updated] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

[jira] [Updated] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

[jira] [Updated] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

[jira] [Created] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

[jira] [Created] (SPARK-30558) Avoid rebuilding `AvroOptions` per each partition

[jira] [Commented] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service

[jira] [Assigned] (SPARK-30539) DataFrame.tail in PySpark API

[jira] [Resolved] (SPARK-30539) DataFrame.tail in PySpark API

[jira] [Resolved] (SPARK-30544) Upgrade Genjavadoc to 0.15

33 matches

Site Navigation

Mail list logo

Footer information