[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018828#comment-17018828 ] Rahul Kumar Challapalli commented on SPARK-30332: - If you cannot narrow down the problem, would you be able to provide the dataset and logs? > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail, > GB_position.lkupCollateralLocation, > GB_position.percentAboveBBR, > GB_position.lkupIsMoreInArrears, > GB_position.lkupIsArrearsCapitalised, > GB_position.lkupCollateralPossession,
[jira] [Resolved] (SPARK-30551) Disable comparison for interval type
[ https://issues.apache.org/jira/browse/SPARK-30551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30551. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27262 [https://github.com/apache/spark/pull/27262] > Disable comparison for interval type > > > Key: SPARK-30551 > URL: https://issues.apache.org/jira/browse/SPARK-30551 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > As we are not going to follow ANSI, it is weird to compare the year-month > part to the day-time part for our current implementation of interval. > Additionally, the current ordering logic comes from PostgreSQL where the > implementation of the interval is messy. And we are not aiming PostgreSQL > compliance at all. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30551) Disable comparison for interval type
[ https://issues.apache.org/jira/browse/SPARK-30551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30551: --- Assignee: Kent Yao > Disable comparison for interval type > > > Key: SPARK-30551 > URL: https://issues.apache.org/jira/browse/SPARK-30551 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > As we are not going to follow ANSI, it is weird to compare the year-month > part to the day-time part for our current implementation of interval. > Additionally, the current ordering logic comes from PostgreSQL where the > implementation of the interval is messy. And we are not aiming PostgreSQL > compliance at all. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30568) Invalidate interval type as a field table schema
Kent Yao created SPARK-30568: Summary: Invalidate interval type as a field table schema Key: SPARK-30568 URL: https://issues.apache.org/jira/browse/SPARK-30568 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao After this commit https://github.com/apache/spark/commit/d67b98ea016e9b714bef68feaac108edd08159c9, we are able to create table or alter table with interval column types if the external catalog accepts which is varying the interval type's purpose for internal usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18455) General support for correlated subquery processing
[ https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-18455: --- Assignee: Dilip Biswal > General support for correlated subquery processing > -- > > Key: SPARK-18455 > URL: https://issues.apache.org/jira/browse/SPARK-18455 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Nattavut Sutyanyong >Assignee: Dilip Biswal >Priority: Major > Attachments: SPARK-18455-scoping-doc.pdf > > > Subquery support has been introduced in Spark 2.0. The initial implementation > covers the most common subquery use case: the ones used in TPC queries for > instance. > Spark currently supports the following subqueries: > * Uncorrelated Scalar Subqueries. All cases are supported. > * Correlated Scalar Subqueries. We only allow subqueries that are aggregated > and use equality predicates. > * Predicate Subqueries. IN or Exists type of queries. We allow most > predicates, except when they are pulled from under an Aggregate or Window > operator. In that case we only support equality predicates. > However this does not cover the full range of possible subqueries. This, in > part, has to do with the fact that we currently rewrite all correlated > subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join. > We currently lack supports for the following use cases: > * The use of predicate subqueries in a projection. > * The use of non-equality predicates below Aggregates and or Window operators. > * The use of non-Aggregate subqueries for correlated scalar subqueries. > This JIRA aims to lift these current limitations in subquery processing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29679) Make interval type camparable and orderable
[ https://issues.apache.org/jira/browse/SPARK-29679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29679. - Resolution: Won't Do > Make interval type camparable and orderable > --- > > Key: SPARK-29679 > URL: https://issues.apache.org/jira/browse/SPARK-29679 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:sql} > postgres=# select INTERVAL '9 years 1 months -1 weeks -4 days -10 hours -46 > minutes' > interval '1 s'; > ?column? > -- > t > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29679) Make interval type camparable and orderable
[ https://issues.apache.org/jira/browse/SPARK-29679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-29679: - > Make interval type camparable and orderable > --- > > Key: SPARK-29679 > URL: https://issues.apache.org/jira/browse/SPARK-29679 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:sql} > postgres=# select INTERVAL '9 years 1 months -1 weeks -4 days -10 hours -46 > minutes' > interval '1 s'; > ?column? > -- > t > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29679) Make interval type camparable and orderable
[ https://issues.apache.org/jira/browse/SPARK-29679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018822#comment-17018822 ] Wenchen Fan commented on SPARK-29679: - This has been reverted in https://github.com/apache/spark/pull/27262 > Make interval type camparable and orderable > --- > > Key: SPARK-29679 > URL: https://issues.apache.org/jira/browse/SPARK-29679 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:sql} > postgres=# select INTERVAL '9 years 1 months -1 weeks -4 days -10 hours -46 > minutes' > interval '1 s'; > ?column? > -- > t > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-30048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018820#comment-17018820 ] Wenchen Fan commented on SPARK-30048: - This has been reverted in https://github.com/apache/spark/pull/27262 > Enable aggregates with interval type values for RelationalGroupedDataset > - > > Key: SPARK-30048 > URL: https://issues.apache.org/jira/browse/SPARK-30048 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Now the min/max/sum/avg are support for intervals, we should also enable it > in RelationalGroupedDataset -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-30048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30048. - Resolution: Won't Do > Enable aggregates with interval type values for RelationalGroupedDataset > - > > Key: SPARK-30048 > URL: https://issues.apache.org/jira/browse/SPARK-30048 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Now the min/max/sum/avg are support for intervals, we should also enable it > in RelationalGroupedDataset -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-30048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-30048: - > Enable aggregates with interval type values for RelationalGroupedDataset > - > > Key: SPARK-30048 > URL: https://issues.apache.org/jira/browse/SPARK-30048 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Now the min/max/sum/avg are support for intervals, we should also enable it > in RelationalGroupedDataset -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28531) Improve Extract Python UDFs optimizer rule to enforce idempotence
[ https://issues.apache.org/jira/browse/SPARK-28531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018818#comment-17018818 ] Xiao Li commented on SPARK-28531: - [~mauzhang] Feel free to submit a PR > Improve Extract Python UDFs optimizer rule to enforce idempotence > - > > Key: SPARK-28531 > URL: https://issues.apache.org/jira/browse/SPARK-28531 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30567) setDelegateCatalog should be called if catalog has implemented CatalogExtension
yu jiantao created SPARK-30567: -- Summary: setDelegateCatalog should be called if catalog has implemented CatalogExtension Key: SPARK-30567 URL: https://issues.apache.org/jira/browse/SPARK-30567 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: yu jiantao Fix For: 3.0.0 CatalogManager.catalog calls Catalogs.load to load a catalog if it is not 'spark_catalog' . If the catalog has implemented CatalogExtension, setDelegateCatalog is not called when the catalog is loaded, which is not like that we have done for v2SessionCatalog, and that makes a confusion for customized session catalog, like iceberg SparkSessionCatalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30566) Iterator doesn't refer outer identifier named "iterator" properly in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-30566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-30566: --- Parent: SPARK-25075 Issue Type: Sub-task (was: Bug) > Iterator doesn't refer outer identifier named "iterator" properly in Scala > 2.13 > --- > > Key: SPARK-30566 > URL: https://issues.apache.org/jira/browse/SPARK-30566 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 > Environment: Scala 2.13 >Reporter: Kousuke Saruta >Priority: Minor > > As of Scala 2.13, scala.collection.Iterator has "iterator" method so if an > inner class of Iterator means to refer an outer identifier named "iterator", > it does not work as we think. > Following is an example. > {code} > val iterator = ... > return new Iterator { > def next() { > iterator.next() // this "iterator" is not what we defined above. > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30566) Iterator doesn't refer outer identifier named "iterator" properly in Scala 2.13
Kousuke Saruta created SPARK-30566: -- Summary: Iterator doesn't refer outer identifier named "iterator" properly in Scala 2.13 Key: SPARK-30566 URL: https://issues.apache.org/jira/browse/SPARK-30566 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.0 Environment: Scala 2.13 Reporter: Kousuke Saruta As of Scala 2.13, scala.collection.Iterator has "iterator" method so if an inner class of Iterator means to refer an outer identifier named "iterator", it does not work as we think. Following is an example. {code} val iterator = ... return new Iterator { def next() { iterator.next() // this "iterator" is not what we defined above. } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30565) Regression in the ORC benchmark
Maxim Gekk created SPARK-30565: -- Summary: Regression in the ORC benchmark Key: SPARK-30565 URL: https://issues.apache.org/jira/browse/SPARK-30565 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk New benchmark results generated in the PR [https://github.com/apache/spark/pull/27078] show regression ~3 times. Before: {code} Hive built-in ORC 520531 8 2.0 495.8 0.6X {code} https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dL138 After: {code} Hive built-in ORC 1761 1792 43 0.61679.3 0.1X {code} https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dR138 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30564) Regression in the wide schema benchmark
[ https://issues.apache.org/jira/browse/SPARK-30564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018653#comment-17018653 ] Maxim Gekk commented on SPARK-30564: [~viirya] Please, take a look at this. Maybe it can be interesting for you. > Regression in the wide schema benchmark > --- > > Key: SPARK-30564 > URL: https://issues.apache.org/jira/browse/SPARK-30564 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > New results of WideSchemaBenchmark generated in the PR: > https://github.com/apache/spark/pull/27078 show regressions up to 2 times. > Before: > {code} > 2500 select expressions103 / 107 0.0 > 102962705.0 0.1X > {code} > https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332L11 > After: > {code} > 2500 select expressions 211214 >4 0.0 210927791.0 0.0X > {code} > https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332R11 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30564) Regression in the wide schema benchmark
[ https://issues.apache.org/jira/browse/SPARK-30564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018652#comment-17018652 ] Maxim Gekk commented on SPARK-30564: Here, regression is ~8 times [https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332R74] > Regression in the wide schema benchmark > --- > > Key: SPARK-30564 > URL: https://issues.apache.org/jira/browse/SPARK-30564 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > New results of WideSchemaBenchmark generated in the PR: > https://github.com/apache/spark/pull/27078 show regressions up to 2 times. > Before: > {code} > 2500 select expressions103 / 107 0.0 > 102962705.0 0.1X > {code} > https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332L11 > After: > {code} > 2500 select expressions 211214 >4 0.0 210927791.0 0.0X > {code} > https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332R11 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30564) Regression in the wide schema benchmark
Maxim Gekk created SPARK-30564: -- Summary: Regression in the wide schema benchmark Key: SPARK-30564 URL: https://issues.apache.org/jira/browse/SPARK-30564 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk New results of WideSchemaBenchmark generated in the PR: https://github.com/apache/spark/pull/27078 show regressions up to 2 times. Before: {code} 2500 select expressions103 / 107 0.0 102962705.0 0.1X {code} https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332L11 After: {code} 2500 select expressions 211214 4 0.0 210927791.0 0.0X {code} https://github.com/apache/spark/pull/27078/files#diff-8d27bbf2f73a68bf0c2025f0702f7332R11 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30563) Regressions in Join benchmarks
Maxim Gekk created SPARK-30563: -- Summary: Regressions in Join benchmarks Key: SPARK-30563 URL: https://issues.apache.org/jira/browse/SPARK-30563 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Regenerated benchmark results in the https://github.com/apache/spark/pull/27078 shows many regressions in JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see old results: https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10 new results: https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10 One of the difference in queries is using the `NoOp` datasource in new queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30562) Regression in interval string parsing
Maxim Gekk created SPARK-30562: -- Summary: Regression in interval string parsing Key: SPARK-30562 URL: https://issues.apache.org/jira/browse/SPARK-30562 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Previously: 11 units w/o interval - 1972.8 ns per row Regenerated results in the PR https://github.com/apache/spark/pull/27078: 11 units w/o interval - 3272.6 ns per row Regression is by 66%, see https://github.com/apache/spark/pull/27078/files#diff-586487fac2b9b1303aaf80adf8fa37abR28 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30561) start spark applications without a 30second startup penalty
t oo created SPARK-30561: Summary: start spark applications without a 30second startup penalty Key: SPARK-30561 URL: https://issues.apache.org/jira/browse/SPARK-30561 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.4 Reporter: t oo see https://stackoverflow.com/questions/57610138/how-to-start-spark-applications-without-a-30second-startup-penalty using spark standalone. There are several sleeps that can be removed: grep -i 'sleep(' -R * | grep -v 'src/test/' | grep -E '^core' | grep -ivE 'mesos|yarn|python|HistoryServer|spark/ui/' core/src/main/scala/org/apache/spark/util/Clock.scala: Thread.sleep(sleepTime) core/src/main/scala/org/apache/spark/SparkContext.scala: * sc.parallelize(1 to 1, 2).map { i => Thread.sleep(10); i }.count() core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala: private def delay(secs: Duration = 5.seconds) = Thread.sleep(secs.toMillis) core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala: Thread.sleep(1000) core/src/main/scala/org/apache/spark/deploy/Client.scala:Thread.sleep(5000) core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala: Thread.sleep(100) core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala: Thread.sleep(duration) core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:def sleep(seconds: Int): Unit = (0 until seconds).takeWhile { _ => core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala: Thread.sleep(1000) core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala: sleeper.sleep(waitSeconds) core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala: def sleep(seconds: Int): Unit core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionClient.scala: Thread.sleep(REPORT_DRIVER_STATUS_INTERVAL) core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala: Thread.sleep(10) core/src/main/scala/org/apache/spark/storage/BlockManager.scala: Thread.sleep(SLEEP_TIME_SECS * 1000L) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30560) allow driver to consume a fractional core
[ https://issues.apache.org/jira/browse/SPARK-30560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] t oo updated SPARK-30560: - Description: see https://stackoverflow.com/questions/56781927/apache-spark-standalone-scheduler-why-does-driver-need-a-whole-core-in-cluste this is to make it possible for a driver to use 0.2 cores rather than a whole core Standard CPUs, no GPUs was: see https://stackoverflow.com/questions/56781927/apache-spark-standalone-scheduler-why-does-driver-need-a-whole-core-in-cluste this is to make it possible for a driver to use 0.2 cores rather than a whole core > allow driver to consume a fractional core > - > > Key: SPARK-30560 > URL: https://issues.apache.org/jira/browse/SPARK-30560 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.4.4 >Reporter: t oo >Priority: Minor > > see > https://stackoverflow.com/questions/56781927/apache-spark-standalone-scheduler-why-does-driver-need-a-whole-core-in-cluste > this is to make it possible for a driver to use 0.2 cores rather than a whole > core > Standard CPUs, no GPUs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30560) allow driver to consume a fractional core
t oo created SPARK-30560: Summary: allow driver to consume a fractional core Key: SPARK-30560 URL: https://issues.apache.org/jira/browse/SPARK-30560 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.4.4 Reporter: t oo see https://stackoverflow.com/questions/56781927/apache-spark-standalone-scheduler-why-does-driver-need-a-whole-core-in-cluste this is to make it possible for a driver to use 0.2 cores rather than a whole core -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive
[ https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ori Popowski updated SPARK-30559: - Description: In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work. # INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema # INFER_AND_SAVE only works the second time {{spark.sql("SELECT …")}} is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work) h3. Expected behavior (according to SPARK-19611) INFER_ONLY - infer the schema from the underlying files INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls h2. Reproduce h3. Prepare the data h4. 1) Create a Parquet file {code:scala} scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t"){code} h4. 2) Inspect the Parquet files {code:sh} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-0-….snappy.parquet {"theString":"a","theNumber":1} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-1-….snappy.parquet {"theString":"b","theNumber":2}{code} We see that they are saved with camelCase column names. h4. 3) Create a Hive table {code:sql} hive> CREATE EXTERNAL TABLE t(theString string, theNumber int) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION 'hdfs:///t';{code} h3. Reproduce INFER_ONLY bug h4. 3) Read the table in Spark using INFER_ONLY {code:sh} $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code} {code:scala} scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) thestring thenumber {code} h4. Conclusion When INFER_ONLY is set, column names are lowercase always. h3. Reproduce INFER_AND_SAVE bug h4. 1) Run the for first time {code:sh} $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code} {code:scala} scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) thestring thenumber{code} We see that column names are lowercase h4. 2) Run for the second time {code:scala} scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) theString theNumber{code} We see that the column names are camelCase h4. Conclusion When INFER_AND_SAVE is set, column names are lowercase on first call and camelCase on subsquent calls. was: In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work. # INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work) h3. Expected behavior (according to SPARK-19611) INFER_ONLY - infer the schema from the underlying files INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls h2. Reproduce h3. Prepare the data h4. 1) Create a Parquet file {code:scala} scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t"){code} h4. 2) Inspect the Parquet files {code:sh} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-0-….snappy.parquet {"theString":"a","theNumber":1} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-1-….snappy.parquet {"theString":"b","theNumber":2}{code} We see that they are saved with camelCase column names. h4. 3) Create a Hive table {code:sql} hive> CREATE EXTERNAL TABLE t(theString string, theNumber int) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION 'hdfs:///t';{code} h3. Reproduce INFER_ONLY bug h4. 3) Read the table in Spark using INFER_ONLY {code:sh} $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code} {code:sh} scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) thestring thenumber {code} h4. Conclusion When INFER_ONLY is set, column names are lowercase always. h3. Reproduce INFER_AND_SAVE bug h4. 1) Run the
[jira] [Updated] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive
[ https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ori Popowski updated SPARK-30559: - Description: In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work. # INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work) h3. Expected behavior (according to SPARK-19611) INFER_ONLY - infer the schema from the underlying files INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls h2. Reproduce h3. Prepare the data h4. 1) Create a Parquet file {code:scala} scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t"){code} h4. 2) Inspect the Parquet files {code:sh} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-0-….snappy.parquet {"theString":"a","theNumber":1} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-1-….snappy.parquet {"theString":"b","theNumber":2}{code} We see that they are saved with camelCase column names. h4. 3) Create a Hive table {code:sql} hive> CREATE EXTERNAL TABLE t(theString string, theNumber int) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION 'hdfs:///t';{code} h3. Reproduce INFER_ONLY bug h4. 3) Read the table in Spark using INFER_ONLY {code:sh} $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code} {code:sh} scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) thestring thenumber {code} h4. Conclusion When INFER_ONLY is set, column names are lowercase always. h3. Reproduce INFER_AND_SAVE bug h4. 1) Run the for first time {code:sh} $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code} {code:scala} scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) thestring thenumber{code} We see that column names are lowercase h4. 2) Run for the second time {code:scala} scala> spark.sql("select * from default.t").columns.foreach(println) theString theNumber{code} We see that the column names are camelCase h4. Conclusion When INFER_AND_SAVE is set, column names are lowercase on first call and camelCase on subsquent calls. was: In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work. # INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work) h3. Expected behavior (according to SPARK-19611) INFER_ONLY - infer the schema from the underlying files INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls h2. Reproduce h3. Prepare the data h4. 1) Create a Parquet file {code:java} scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t"){code} h4. 2) Inspect the Parquet files {code:java} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-0-….snappy.parquet {"theString":"a","theNumber":1} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-1-….snappy.parquet {"theString":"b","theNumber":2}{code} We see that they are saved with camelCase column names. h4. 3) Create a Hive table {code:java} hive> CREATE EXTERNAL TABLE t(theString string, theNumber int) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION 'hdfs:///t';{code} h3. Reproduce INFER_ONLY bug h4. 3) Read the table in Spark using INFER_ONLY {code:java} $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code} {code:java} scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) thestring thenumber {code} h4. Conclusion When INFER_ONLY is set, column names are lowercase always. h3. Reproduce INFER_AND_SAVE bug h4. 1) Run the
[jira] [Updated] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive
[ https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ori Popowski updated SPARK-30559: - Description: In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work. # INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work) h3. Expected behavior (according to SPARK-19611) INFER_ONLY - infer the schema from the underlying files INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls h2. Reproduce h3. Prepare the data h4. 1) Create a Parquet file {code:java} scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t"){code} h4. 2) Inspect the Parquet files {code:java} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-0-….snappy.parquet {"theString":"a","theNumber":1} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-1-….snappy.parquet {"theString":"b","theNumber":2}{code} We see that they are saved with camelCase column names. h4. 3) Create a Hive table {code:java} hive> CREATE EXTERNAL TABLE t(theString string, theNumber int) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION 'hdfs:///t';{code} h3. Reproduce INFER_ONLY bug h4. 3) Read the table in Spark using INFER_ONLY {code:java} $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code} {code:java} scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) thestring thenumber {code} h4. Conclusion When INFER_ONLY is set, column names are lowercase always. h3. Reproduce INFER_AND_SAVE bug h4. 1) Run the for first time {code:java} $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code} {code:java} scala> spark.sql("select * from default.t").columns.foreach(println) thestring thenumber{code} We see that column names are lowercase h4. 2) Run for the second time {code:java} scala> spark.sql("select * from default.t").columns.foreach(println) theString theNumber{code} We see that the column names are camelCase h4. Conclusion When INFER_AND_SAVE is set, column names are lowercase on first call and camelCase on subsquent calls. was: Spark SQL's spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work. # INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work) h3. Expected behavior (according to SPARK-19611) INFER_ONLY - infer the schema from the underlying files INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls h2. Reproduce h3. Prepare the data h4. 1) Create a Parquet file {code:java} scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t"){code} h4. 2) Inspect the Parquet files {code:java} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-0-….snappy.parquet {"theString":"a","theNumber":1} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-1-….snappy.parquet {"theString":"b","theNumber":2}{code} We see that they are saved with camelCase column names. h4. 3) Create a Hive table {code:java} hive> CREATE EXTERNAL TABLE t(theString string, theNumber int) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION 'hdfs:///t';{code} h3. Reproduce INFER_ONLY bug h4. 3) Read the table in Spark using INFER_ONLY {code:java} $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code} {code:java} scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) thestring thenumber {code} h4. Conclusion When INFER_ONLY is set, column names are lowercase always. h3. Reproduce
[jira] [Created] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive
Ori Popowski created SPARK-30559: Summary: Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive Key: SPARK-30559 URL: https://issues.apache.org/jira/browse/SPARK-30559 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Environment: EMR 28.1 with Spark 2.4.4, Hadoop 2.8.5 and Hive 2.3.6 Reporter: Ori Popowski Spark SQL's spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work. # INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work) h3. Expected behavior (according to SPARK-19611) INFER_ONLY - infer the schema from the underlying files INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls h2. Reproduce h3. Prepare the data h4. 1) Create a Parquet file {code:java} scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t"){code} h4. 2) Inspect the Parquet files {code:java} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-0-….snappy.parquet {"theString":"a","theNumber":1} $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-1-….snappy.parquet {"theString":"b","theNumber":2}{code} We see that they are saved with camelCase column names. h4. 3) Create a Hive table {code:java} hive> CREATE EXTERNAL TABLE t(theString string, theNumber int) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION 'hdfs:///t';{code} h3. Reproduce INFER_ONLY bug h4. 3) Read the table in Spark using INFER_ONLY {code:java} $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code} {code:java} scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) thestring thenumber {code} h4. Conclusion When INFER_ONLY is set, column names are lowercase always. h3. Reproduce INFER_AND_SAVE bug h4. 1) Run the for first time {code:java} $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code} {code:java} scala> spark.sql("select * from default.t").columns.foreach(println) thestring thenumber{code} We see that column names are lowercase h4. 2) Run for the second time {code:java} scala> spark.sql("select * from default.t").columns.foreach(println) theString theNumber{code} We see that the column names are camelCase h4. Conclusion When INFER_AND_SAVE is set, column names are lowercase on first call and camelCase on subsquent calls. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30558) Avoid rebuilding `AvroOptions` per each partition
Maxim Gekk created SPARK-30558: -- Summary: Avoid rebuilding `AvroOptions` per each partition Key: SPARK-30558 URL: https://issues.apache.org/jira/browse/SPARK-30558 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, an instance of `AvroOptions` is created per each partition. This can be avoid by building it only once and pass to `AvroScan`. See https://github.com/apache/spark/pull/27174#discussion_r365596481 for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service
[ https://issues.apache.org/jira/browse/SPARK-27750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018544#comment-17018544 ] t oo commented on SPARK-27750: -- did some digging, let me know if I'm on right track. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L785 ---> why precedence comment? i want the opposite in private def schedule(): Unit = { add: rawFreeCores = shuffledAliveWorkers.map(_.coresFree).sum cores_reserved_for_apps = *get from config somehow* ie 8 forDriversFreeCores = math.max(rawFreeCores-cores_reserved_for_apps,0) then wrap the inner steps in: if forDriversFreeCores >= driver.desc.cores { canLaunchDriver } https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L796 #just note about driver/exec requests https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L746-L775 > Standalone scheduler - ability to prioritize applications over drivers, many > drivers act like Denial of Service > --- > > Key: SPARK-27750 > URL: https://issues.apache.org/jira/browse/SPARK-27750 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: t oo >Priority: Minor > > If I submit 1000 spark submit drivers then they consume all the cores on my > cluster (essentially it acts like a Denial of Service) and no spark > 'application' gets to run since the cores are all consumed by the 'drivers'. > This feature is about having the ability to prioritize applications over > drivers so that at least some 'applications' can start running. I guess it > would be like: If (driver.state = 'submitted' and (exists some app.state = > 'submitted')) then set app.state = 'running' > if all apps have app.state = 'running' then set driver.state = 'submitted' > > Secondary to this, why must a driver consume a minimum of 1 entire core? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30539) DataFrame.tail in PySpark API
[ https://issues.apache.org/jira/browse/SPARK-30539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30539: - Assignee: Hyukjin Kwon > DataFrame.tail in PySpark API > - > > Key: SPARK-30539 > URL: https://issues.apache.org/jira/browse/SPARK-30539 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > SPARK-30185 added DataFrame.tail API. It should be good for PySpark side to > have it too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30539) DataFrame.tail in PySpark API
[ https://issues.apache.org/jira/browse/SPARK-30539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30539. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27251 [https://github.com/apache/spark/pull/27251] > DataFrame.tail in PySpark API > - > > Key: SPARK-30539 > URL: https://issues.apache.org/jira/browse/SPARK-30539 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > SPARK-30185 added DataFrame.tail API. It should be good for PySpark side to > have it too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30544) Upgrade Genjavadoc to 0.15
[ https://issues.apache.org/jira/browse/SPARK-30544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30544. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27255 [https://github.com/apache/spark/pull/27255] > Upgrade Genjavadoc to 0.15 > -- > > Key: SPARK-30544 > URL: https://issues.apache.org/jira/browse/SPARK-30544 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.0.0 > > > Genjavadoc 0.14 doesn't support Scala so sbt -Pscala-2.13 will fail to build. > Let's upgrade it to 0.15. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org