date:20211226

[jira] [Commented] (SPARK-37740) Migrate to M1 machines in Jenkins

2021-12-26 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465560#comment-17465560
 ] 

Dongjoon Hyun commented on SPARK-37740:
---

Thank you, [~hyukjin.kwon]!

> Migrate to M1 machines in Jenkins
> -
>
> Key: SPARK-37740
> URL: https://issues.apache.org/jira/browse/SPARK-37740
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See 
> https://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3CCACdU-dTLuB--1GzAv6XfS-pCrcihhvDpUMrGe%3DfJXUYJpqiX9Q%40mail.gmail.com%3E.
> We should revisit all related Jenkins specific codes when M1 machines are 
> ready.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37749) Built-in ORC reader cannot read data file in sub-directories created by Hive Tez

2021-12-26 Thread Ye Li (Jira)

Ye Li created SPARK-37749:
-

 Summary: Built-in ORC reader cannot read data file in 
sub-directories created by Hive Tez
 Key: SPARK-37749
 URL: https://issues.apache.org/jira/browse/SPARK-37749
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Affects Versions: 3.2.0, 3.1.2, 3.0.3
 Environment: HDP 3.1.4
Reporter: Ye Li


A Partitioned Hive Table is created and load data in HDP 3.1.4. The Hive engine 
is Tez, and the storage format is ORC. The data direcotry is like:

table1/statt_dt=2021-12-08/-ext-1/00_0

 

The result of SparkSQL which is "select * from table1" does not include the 
data of partition 2021-12-08.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37747) Upgrade zstd-jni to 1.5.1-1

2021-12-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465550#comment-17465550
 ] 

Apache Spark commented on SPARK-37747:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35030

> Upgrade zstd-jni to 1.5.1-1
> ---
>
> Key: SPARK-37747
> URL: https://issues.apache.org/jira/browse/SPARK-37747
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34544) pyspark toPandas() should return pd.DataFrame

2021-12-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465507#comment-17465507
 ] 

Apache Spark commented on SPARK-34544:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35029

> pyspark toPandas() should return pd.DataFrame
> -
>
> Key: SPARK-34544
> URL: https://issues.apache.org/jira/browse/SPARK-34544
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: Rafal Wojdyla
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Right now {{toPandas()}} returns {{DataFrameLike}}, which is an incomplete 
> "view" of pandas {{DataFrame}}. Which leads to cases like mypy reporting that 
> certain pandas methods are not present in {{DataFrameLike}}, even tho those 
> methods are valid methods on pandas {{DataFrame}}, which is the actual type 
> of the object. This requires type ignore comments or asserts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34544) pyspark toPandas() should return pd.DataFrame

2021-12-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465506#comment-17465506
 ] 

Apache Spark commented on SPARK-34544:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35029

> pyspark toPandas() should return pd.DataFrame
> -
>
> Key: SPARK-34544
> URL: https://issues.apache.org/jira/browse/SPARK-34544
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: Rafal Wojdyla
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Right now {{toPandas()}} returns {{DataFrameLike}}, which is an incomplete 
> "view" of pandas {{DataFrame}}. Which leads to cases like mypy reporting that 
> certain pandas methods are not present in {{DataFrameLike}}, even tho those 
> methods are valid methods on pandas {{DataFrame}}, which is the actual type 
> of the object. This requires type ignore comments or asserts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37741) Remove Jenkins badge in README.md

2021-12-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37741.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35025
[https://github.com/apache/spark/pull/35025]

> Remove Jenkins badge in README.md
> -
>
> Key: SPARK-37741
> URL: https://issues.apache.org/jira/browse/SPARK-37741
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 3.3.0
>
>
> Should remove Jenkins badge that is obsolete now in 
> https://github.com/apache/spark/blob/master/README.md.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37740) Migrate to M1 machines in Jenkins

2021-12-26 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465499#comment-17465499
 ] 

Hyukjin Kwon commented on SPARK-37740:
--

cc [~dongjoon] [~viirya] [~dbtsai] FYI

> Migrate to M1 machines in Jenkins
> -
>
> Key: SPARK-37740
> URL: https://issues.apache.org/jira/browse/SPARK-37740
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See 
> https://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3CCACdU-dTLuB--1GzAv6XfS-pCrcihhvDpUMrGe%3DfJXUYJpqiX9Q%40mail.gmail.com%3E.
> We should revisit all related Jenkins specific codes when M1 machines are 
> ready.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37743) Revisit codes in Scala tests

2021-12-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37743:
-
Summary: Revisit codes in Scala tests  (was: Remove/update obsolete codes 
in Scala tests)

> Revisit codes in Scala tests
> 
>
> Key: SPARK-37743
> URL: https://issues.apache.org/jira/browse/SPARK-37743
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:  // 
> TODO(SPARK-9603): Building a package is flaky on Jenkins Maven builds.
> core/src/test/scala/org/apache/spark/deploy/SparkSubmitTestUtils.scala:  
> // This test suite has some weird behaviors when executed on Jenkins:
> core/src/test/scala/org/apache/spark/deploy/SparkSubmitTestUtils.scala:  
> // 1. Sometimes it gets extremely slow out of unknown reason on Jenkins.  
> Here we add a
> core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala:  // 
> is only 2, while on Jenkins it's 32. For this specific test, 2 available 
> processors, which
> external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala:
>  * when running on a slow Jenkins machine) before records start to be 
> removed. To make sure a test
> project/SparkBuild.scala:  // with Jenkins flakiness.
> repl/src/test/scala/org/apache/spark/repl/SparkShellSuite.scala:  // This 
> test suite sometimes gets extremely slow out of unknown reason on Jenkins.  
> Here we
> sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
> // TODO: Why fs.getContentSummary returns wrong size on Jenkins?
> sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
> // Seems fs.getContentSummary returns wrong table size on Jenkins. So we 
> use
> sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala:// 
> the last dot due to a bug in SBT. This makes easier to debug via Jenkins test 
> result
> sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreIntegrationSuite.scala:
>   // Should emit new progresses every 10 ms, but we could be facing a 
> slow Jenkins
> sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala:
> // Should emit new progresses every 10 ms, but we could be facing a 
> slow Jenkins
> sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala:
>   // This test suite sometimes gets extremely slow out of unknown reason 
> on Jenkins.  Here we
> sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala:
>   // started at a time, which is not Jenkins friendly.
> sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientBuilder.scala:
>   // In order to speed up test execution during development or in Jenkins, 
> you can specify the path
> sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala:
>   // more cores, the issue can be reproduced steadily.  Fortunately our 
> Jenkins builder meets this
> streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala:
>   // If Jenkins is slow, we may not have a chance to run many threads 
> simultaneously. Having
> core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala:  
> // This is to avoid any potential flakiness in the test because of large 
> pauses in jenkins
> sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java: 
>  // (e.g., /home/jenkins/workspace/SparkPullRequestBuilder@2)
> sql/core/src/test/resources/sql-tests/inputs/datetime-parsing-invalid.sql:-- 
> in java 8 this case is invalid, but valid in java 11, disabled for jenkins
> sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala:
> // TODO: Figure out why StringType doesn't work on jenkins.
> sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala:
> // Weird DDL differences result in failures on jenkins.
> project/SparkBuild.scala:  // with Jenkins flakiness.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37741) Remove Jenkins badge in README.md

2021-12-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37741:


Assignee: Hyukjin Kwon

> Remove Jenkins badge in README.md
> -
>
> Key: SPARK-37741
> URL: https://issues.apache.org/jira/browse/SPARK-37741
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
>
> Should remove Jenkins badge that is obsolete now in 
> https://github.com/apache/spark/blob/master/README.md.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37742) Revisit codes in testing scripts for Jenkins

2021-12-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37742:
-
Summary: Revisit codes in testing scripts for Jenkins  (was: Remove 
obsolete code in testing scripts for Jenkins)

> Revisit codes in testing scripts for Jenkins
> 
>
> Key: SPARK-37742
> URL: https://issues.apache.org/jira/browse/SPARK-37742
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> e.g.)
> https://github.com/apache/spark/blob/master/dev/run-tests-jenkins
> https://github.com/apache/spark/blob/master/dev/run-tests-jenkins.py
> Jenkins specific logics at 
> https://github.com/apache/spark/blob/master/dev/run-tests.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37744) Revisit codes in PySpark, SparkR and other docs

2021-12-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37744:
-
Summary: Revisit codes in PySpark, SparkR and other docs  (was: 
Remove/update obsolete codes in PySpark, SparkR and other docs)

> Revisit codes in PySpark, SparkR and other docs
> ---
>
> Key: SPARK-37744
> URL: https://issues.apache.org/jira/browse/SPARK-37744
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> R/check-cran.sh:# Jenkins installs arrow. See SPARK-29339.
> R/pkg/tests/run-all.R:  # CRAN machines. For Jenkins we should already have 
> SPARK_HOME set.
> R/run-tests.sh:# We have 2 NOTEs: for RoxygenNote and one in Jenkins only 
> "No repository set"
> dev/run-pip-tests:# Jenkins has PySpark installed under user sitepackages 
> shared for some reasons.
> python/pyspark/sql/tests/test_streaming.py:# Jenkins is very 
> slow, we don't assert it. If there is something wrong, "lastProgress"
> python/pyspark/streaming/tests/test_kinesis.py:# Don't start the 
> StreamingContext because we cannot test it in Jenkins
> dev/create-release/release-build.sh:# This is a band-aid fix to avoid the 
> failure of Maven nightly snapshot in some Jenkins
> docs/building-spark.md:## Running Jenkins tests with GitHub Enterprise
> docs/building-spark.md:To run tests with Jenkins:
> dev/tests/pr_merge_ability.sh:# found at dev/run-tests-jenkins.
> dev/tests/pr_merge_ability.sh:# known as `ghprbActualCommit` in 
> `run-tests-jenkins`
> dev/tests/pr_merge_ability.sh:# known as `sha1` in `run-tests-jenkins`
> dev/tests/pr_public_classes.sh:# found at dev/run-tests-jenkins.
> dev/tests/pr_public_classes.sh:# known as `ghprbActualCommit` in 
> `run-tests-jenkins`
> docs/building-spark.md:./dev/run-tests-jenkins
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37740) Migrate to M1 machines in Jenkins

2021-12-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37740:
-
Description: 
See 
https://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3CCACdU-dTLuB--1GzAv6XfS-pCrcihhvDpUMrGe%3DfJXUYJpqiX9Q%40mail.gmail.com%3E.

Jenkins is retired, and we should revisit all related Jenkins specific codes 
when M1 machines are ready.

  was:
See 
https://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3CCACdU-dTLuB--1GzAv6XfS-pCrcihhvDpUMrGe%3DfJXUYJpqiX9Q%40mail.gmail.com%3E.

Jenkins is retired, and we should remove all related Jenkins specific codes 
that are not used anymore in Apache Spark.


> Migrate to M1 machines in Jenkins
> -
>
> Key: SPARK-37740
> URL: https://issues.apache.org/jira/browse/SPARK-37740
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See 
> https://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3CCACdU-dTLuB--1GzAv6XfS-pCrcihhvDpUMrGe%3DfJXUYJpqiX9Q%40mail.gmail.com%3E.
> Jenkins is retired, and we should revisit all related Jenkins specific codes 
> when M1 machines are ready.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37748) Add Jenkins badge back when M1 machines are ready

2021-12-26 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-37748:


 Summary: Add Jenkins badge back when M1 machines are ready
 Key: SPARK-37748
 URL: https://issues.apache.org/jira/browse/SPARK-37748
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, Project Infra
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


Currently, the badge is removed as Jenkins is retired at 
https://issues.apache.org/jira/browse/SPARK-37741. Should consider bringing the 
badge back when M1 machines are ready.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37740) Migrate to M1 machines in Jenkins

2021-12-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37740:
-
Summary: Migrate to M1 machines in Jenkins  (was: Removal of obsolete codes 
per Jenkins' retirement)

> Migrate to M1 machines in Jenkins
> -
>
> Key: SPARK-37740
> URL: https://issues.apache.org/jira/browse/SPARK-37740
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See 
> https://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3CCACdU-dTLuB--1GzAv6XfS-pCrcihhvDpUMrGe%3DfJXUYJpqiX9Q%40mail.gmail.com%3E.
> Jenkins is retired, and we should remove all related Jenkins specific codes 
> that are not used anymore in Apache Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37740) Migrate to M1 machines in Jenkins

2021-12-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37740:
-
Description: 
See 
https://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3CCACdU-dTLuB--1GzAv6XfS-pCrcihhvDpUMrGe%3DfJXUYJpqiX9Q%40mail.gmail.com%3E.

We should revisit all related Jenkins specific codes when M1 machines are ready.

  was:
See 
https://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3CCACdU-dTLuB--1GzAv6XfS-pCrcihhvDpUMrGe%3DfJXUYJpqiX9Q%40mail.gmail.com%3E.

Jenkins is retired, and we should revisit all related Jenkins specific codes 
when M1 machines are ready.


> Migrate to M1 machines in Jenkins
> -
>
> Key: SPARK-37740
> URL: https://issues.apache.org/jira/browse/SPARK-37740
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See 
> https://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3CCACdU-dTLuB--1GzAv6XfS-pCrcihhvDpUMrGe%3DfJXUYJpqiX9Q%40mail.gmail.com%3E.
> We should revisit all related Jenkins specific codes when M1 machines are 
> ready.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37578) DSV2 is not updating Output Metrics

2021-12-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37578:


Assignee: Apache Spark

> DSV2 is not updating Output Metrics
> ---
>
> Key: SPARK-37578
> URL: https://issues.apache.org/jira/browse/SPARK-37578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Assignee: Apache Spark
>Priority: Major
>
> Repro code
> ./bin/spark-shell --master local  --jars 
> /Users/jars/iceberg-spark3-runtime-0.12.1.jar
>  
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler._val bytesWritten = new 
> mutable.ArrayBuffer[Long]()
> val recordsWritten = new mutable.ArrayBuffer[Long]()
> val bytesWrittenListener = new SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten
>     recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
>   }
> }
> spark.sparkContext.addSparkListener(bytesWrittenListener)
> try {
> val df = spark.range(1000).toDF("id")
>   df.write.format("iceberg").save("Users/data/dsv2_test")
>   
> assert(bytesWritten.sum > 0)
> assert(recordsWritten.sum > 0)
> } finally {
>   spark.sparkContext.removeSparkListener(bytesWrittenListener)
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37578) DSV2 is not updating Output Metrics

2021-12-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465498#comment-17465498
 ] 

Apache Spark commented on SPARK-37578:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/35028

> DSV2 is not updating Output Metrics
> ---
>
> Key: SPARK-37578
> URL: https://issues.apache.org/jira/browse/SPARK-37578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Priority: Major
>
> Repro code
> ./bin/spark-shell --master local  --jars 
> /Users/jars/iceberg-spark3-runtime-0.12.1.jar
>  
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler._val bytesWritten = new 
> mutable.ArrayBuffer[Long]()
> val recordsWritten = new mutable.ArrayBuffer[Long]()
> val bytesWrittenListener = new SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten
>     recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
>   }
> }
> spark.sparkContext.addSparkListener(bytesWrittenListener)
> try {
> val df = spark.range(1000).toDF("id")
>   df.write.format("iceberg").save("Users/data/dsv2_test")
>   
> assert(bytesWritten.sum > 0)
> assert(recordsWritten.sum > 0)
> } finally {
>   spark.sparkContext.removeSparkListener(bytesWrittenListener)
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37578) DSV2 is not updating Output Metrics

2021-12-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37578:


Assignee: (was: Apache Spark)

> DSV2 is not updating Output Metrics
> ---
>
> Key: SPARK-37578
> URL: https://issues.apache.org/jira/browse/SPARK-37578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Priority: Major
>
> Repro code
> ./bin/spark-shell --master local  --jars 
> /Users/jars/iceberg-spark3-runtime-0.12.1.jar
>  
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler._val bytesWritten = new 
> mutable.ArrayBuffer[Long]()
> val recordsWritten = new mutable.ArrayBuffer[Long]()
> val bytesWrittenListener = new SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten
>     recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
>   }
> }
> spark.sparkContext.addSparkListener(bytesWrittenListener)
> try {
> val df = spark.range(1000).toDF("id")
>   df.write.format("iceberg").save("Users/data/dsv2_test")
>   
> assert(bytesWritten.sum > 0)
> assert(recordsWritten.sum > 0)
> } finally {
>   spark.sparkContext.removeSparkListener(bytesWrittenListener)
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37747) Upgrade zstd-jni to 1.5.1-1

2021-12-26 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37747.
---
Fix Version/s: 3.3.0
 Assignee: William Hyun
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/35027

> Upgrade zstd-jni to 1.5.1-1
> ---
>
> Key: SPARK-37747
> URL: https://issues.apache.org/jira/browse/SPARK-37747
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37747) Upgrade zstd-jni to 1.5.1-1

2021-12-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465452#comment-17465452
 ] 

Apache Spark commented on SPARK-37747:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35027

> Upgrade zstd-jni to 1.5.1-1
> ---
>
> Key: SPARK-37747
> URL: https://issues.apache.org/jira/browse/SPARK-37747
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37747) Upgrade zstd-jni to 1.5.1-1

2021-12-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37747:


Assignee: Apache Spark

> Upgrade zstd-jni to 1.5.1-1
> ---
>
> Key: SPARK-37747
> URL: https://issues.apache.org/jira/browse/SPARK-37747
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37747) Upgrade zstd-jni to 1.5.1-1

2021-12-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465451#comment-17465451
 ] 

Apache Spark commented on SPARK-37747:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35027

> Upgrade zstd-jni to 1.5.1-1
> ---
>
> Key: SPARK-37747
> URL: https://issues.apache.org/jira/browse/SPARK-37747
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37747) Upgrade zstd-jni to 1.5.1-1

2021-12-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37747:


Assignee: (was: Apache Spark)

> Upgrade zstd-jni to 1.5.1-1
> ---
>
> Key: SPARK-37747
> URL: https://issues.apache.org/jira/browse/SPARK-37747
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37747) Upgrade zstd-jni to 1.5.1-1

2021-12-26 Thread William Hyun (Jira)

William Hyun created SPARK-37747:


 Summary: Upgrade zstd-jni to 1.5.1-1
 Key: SPARK-37747
 URL: https://issues.apache.org/jira/browse/SPARK-37747
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37697) Make it easier to convert numpy arrays to Spark Dataframes

2021-12-26 Thread Daniel Davies (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465443#comment-17465443
 ] 

Daniel Davies edited comment on SPARK-37697 at 12/26/21, 8:29 PM:
--

Hey Douglas,
 
I've definitely been caught by numpy types a few times in some of our spark 
workflows, so would be keen to solve in PySpark directly also.
 
Do you want to be able to create a DataFrame from a list of numpy numbers 
directly? This is supported in Pandas, but I think it's not possible to do this 
even with native Python types in Spark (e.g. my understanding is that the input 
to createDataFrame is required to be an iterable of rows), so maybe there's a 
discussion around supporting that. For example, running the below:
{code:java}
df = spark.createDataFrame([1,2,3,4,5]){code}
Fails with:
{code:java}
TypeError: Can not infer schema for type:  {code}
The more common issue I've come across though is something like the following 
not being supported:
{code:java}
df = spark.createDataFrame([np.arange(10), np.arange(10)]){code}
(I.e. where each row could be a numpy list and/or the overall input list of 
rows is wrapped in a numpy array also).
 
I'd be happy to take the work on for this PR- whether the first possible case, 
or the second (it would be my first contribution, so if you/ anyone else think 
this is more complex than I'm currently estimating below, let me know).
 
For creating a dataframe with a flat iterable, this looks like it would be an 
addition of a createDataFrame function around 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L700]
 
For the second problem- i.e. where the input model remains the same, but rows 
are provided as numpy arrays; I'd be keen to re-use numpy's ndarray tolist() 
function here. Not only does this push the underlying array object into Python 
lists, which PySpark already supports, but it also has the benefit of 
converting list items of the numpy-specific types to Python native scalars. 
From a brief glance I've taken, it looks like this would need to be taken into 
account in three places:
 
 - In the set of prepare functions in session.py 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L912]
 - In the conversion function 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1447]
 - In the schema inference function 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1280]
 
This would work for inputs where rows are numpy arrays of any type; but a bit 
more work would be needed to make a row like the following work:
 
{code:java}
[1, 2, numpy.int64(3)]{code}

 

Hope that all makes sense- let me know which of the two problems you are more 
interested in solving.
 
I'd also be keen to get a review from someone of whether any of my solutions 
made sense


was (Author: JIRAUSER282609):
Hey Douglas,
 
I've definitely been caught by numpy types a few times in some of our spark 
workflows, so would be keen to solve in PySpark directly also.
 
Do you want to be able to create a DataFrame from a list of numpy numbers 
directly? This is supported in Pandas, but I think it's not possible to do this 
even with native Python types in Spark (e.g. my understanding is that the input 
to createDataFrame is required to be an iterable of rows), so maybe there's a 
discussion around supporting that. For example, running the below:
{code:java}
df = spark.createDataFrame([1,2,3,4,5]){code}
Fails with:
{code:java}
TypeError: Can not infer schema for type:  {code}
The more common issue I've come across though is something like the following 
not being supported:
{code:java}
df = spark.createDataFrame([np.arange(10), np.arange(10)]){code}
(I.e. where each row could be a numpy list and/or the overall input list of 
rows is wrapped in a numpy array also).
 
I'd be happy to take the work on for this PR- whether the first possible case, 
or the second (it would be my first contribution, so if you/ anyone else think 
this is more complex than I'm currently estimating below, let me know).
 
For creating a dataframe with a flat iterable, this looks like it would be an 
addition of a createDataFrame function around 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L700]
 
For the second problem- i.e. where the input model remains the same, but rows 
are provided as numpy arrays; I'd be keen to re-use numpy's ndarray tolist() 
function here. Not only does this push the underlying array object into Python 
lists, which PySpark already supports, but it also has the benefit of 
converting list items of the numpy-specific types to Python native scalars. 
From a brief glance I've taken, it looks like this would need to be taken into 
account in three places:
 
 - In the set of prepare functions in session.py

[jira] [Comment Edited] (SPARK-37697) Make it easier to convert numpy arrays to Spark Dataframes

2021-12-26 Thread Daniel Davies (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465443#comment-17465443
 ] 

Daniel Davies edited comment on SPARK-37697 at 12/26/21, 8:24 PM:
--

Hey Douglas,
 
I've definitely been caught by numpy types a few times in some of our spark 
workflows, so would be keen to solve in PySpark directly also.
 
Do you want to be able to create a DataFrame from a list of numpy numbers 
directly? This is supported in Pandas, but I think it's not possible to do this 
even with native Python types in Spark (e.g. my understanding is that the input 
to createDataFrame is required to be an iterable of rows), so maybe there's a 
discussion around supporting that. For example, running the below:
{code:java}
df = spark.createDataFrame([1,2,3,4,5]){code}
Fails with:
{code:java}
TypeError: Can not infer schema for type:  {code}
The more common issue I've come across though is something like the following 
not being supported:
{code:java}
df = spark.createDataFrame([np.arange(10), np.arange(10)]){code}
(I.e. where each row could be a numpy list and/or the overall input list of 
rows is wrapped in a numpy array also).
 
I'd be happy to take the work on for this PR- whether the first possible case, 
or the second (it would be my first contribution, so if you/ anyone else think 
this is more complex than I'm currently estimating below, let me know).
 
For creating a dataframe with a flat iterable, this looks like it would be an 
addition of a createDataFrame function around 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L700]
 
For the second problem- i.e. where the input model remains the same, but rows 
are provided as numpy arrays; I'd be keen to re-use numpy's ndarray tolist() 
function here. Not only does this push the underlying array object into Python 
lists, which PySpark already supports, but it also has the benefit of 
converting list items of the numpy-specific types to Python native scalars. 
From a brief glance I've taken, it looks like this would need to be taken into 
account in three places:
 
 - In the set of prepare functions in session.py 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L912]
 - In the conversion function 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1447]
 - In the schema inference function 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1280]
 
This would work for inputs where rows are numpy arrays of any type; but a bit 
more work would be needed to make a row like the following work:
 
{code:java}
[1, 2, numpy.int64(3)]{code}

 

Hope that all makes sense- let me know which of the two problems you are more 
interested in solving.
 
I'd also be keen to get a review of whether any of my solutions made sense


was (Author: JIRAUSER282609):
Hey Douglas,
 
I've definitely been caught by numpy types a few times in some of our spark 
workflows, so would be keen to solve in PySpark directly also.
 
Do you want to be able to create a DataFrame from a list of numpy numbers 
directly? This is supported in Pandas, but I think it's not possible to do this 
even with native Python types in Spark (e.g. my understanding is that the input 
to createDataFrame is required to be an iterable of rows), so maybe there's a 
discussion around supporting that. For example, running the below:
{code:java}
df = spark.createDataFrame([1,2,3,4,5]){code}
Fails with:
{code:java}
TypeError: Can not infer schema for type:  {code}
The more common issue I've come across though is something like the following 
not being supported:
{code:java}
df = spark.createDataFrame([np.arange(10), np.arange(10)]){code}
(I.e. where each row could be a numpy list and/or the overall input list of 
rows is wrapped in a numpy array also).
 
I'd be happy to take the work on for this PR- whether the first possible case, 
or the second (it would be my first contribution, so if you/ anyone else think 
this is more complex than I'm currently estimating below, let me know).
 
For creating a dataframe with a flat iterable, this looks like it would be an 
addition of a createDataFrame function around 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L700]
 
For the second problem- i.e. where the input model remains the same, but rows 
are provided as numpy arrays; I'd be keen to re-use numpy's ndarray tolist() 
function here. Not only does this push the underlying array object into Python 
lists, which PySpark already supports, but it also has the benefit of 
converting list items of the numpy-specific types to Python native scalars. 
From a brief glance I've taken, it looks like this would need to be taken into 
account in three places:
 
 - In the set of prepare functions in session.py

[jira] [Comment Edited] (SPARK-37697) Make it easier to convert numpy arrays to Spark Dataframes

2021-12-26 Thread Daniel Davies (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465443#comment-17465443
 ] 

Daniel Davies edited comment on SPARK-37697 at 12/26/21, 8:23 PM:
--

Hey Douglas,
 
I've definitely been caught by numpy types a few times in some of our spark 
workflows, so would be keen to solve in PySpark directly also.
 
Do you want to be able to create a DataFrame from a list of numpy numbers 
directly? This is supported in Pandas, but I think it's not possible to do this 
even with native Python types in Spark (e.g. my understanding is that the input 
to createDataFrame is required to be an iterable of rows), so maybe there's a 
discussion around supporting that. For example, running the below:
{code:java}
df = spark.createDataFrame([1,2,3,4,5]){code}
Fails with:
{code:java}
TypeError: Can not infer schema for type:  {code}
The more common issue I've come across though is something like the following 
not being supported:
{code:java}
df = spark.createDataFrame([np.arange(10), np.arange(10)]){code}
(I.e. where each row could be a numpy list and/or the overall input list of 
rows is wrapped in a numpy array also).
 
I'd be happy to take the work on for this PR- whether the first possible case, 
or the second (it would be my first contribution, so if you/ anyone else think 
this is more complex than I'm currently estimating below, let me know).
 
For creating a dataframe with a flat iterable, this looks like it would be an 
addition of a createDataFrame function around 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L700]
 
For the second problem- i.e. where the input model remains the same, but rows 
are provided as numpy arrays; I'd be keen to re-use numpy's ndarray tolist() 
function here. Not only does this push the underlying array object into Python 
lists, which PySpark already supports, but it also has the benefit of 
converting list items of the numpy-specific types to Python native scalars. 
From a brief glance I've taken, it looks like this would need to be taken into 
account in three places:
 
 - In the set of prepare functions in session.py 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L912]
 - In the conversion function 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1447]
 - In the schema inference function 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1280]
 
This would work for inputs where rows are numpy arrays of any type; but a bit 
more work would be needed to make a row like the following work:
 
{code:java}
[1, 2, numpy.int64(3)]{code}

 

Hope that all makes sense- let me know which of the two problems you meant & 
are more interested in solving.
 
I'd also be keen to get a review of whether any of my solutions made sense


was (Author: JIRAUSER282609):
Hey Douglas,
 
I've definitely been caught by numpy types a few times in some of our spark 
workflows, so would be keen to solve in PySpark directly also.
 
Do you want to be able to create a DataFrame from a list of numpy numbers 
directly? This is supported in Pandas, but I think it's not possible to do this 
even with native Python types in Spark (e.g. my understanding is that the input 
to createDataFrame is required to be an iterable of rows), so maybe there's a 
discussion around supporting that. For example, running the below:
{code:java}
df = spark.createDataFrame([1,2,3,4,5]){code}
Fails with:
{code:java}
TypeError: Can not infer schema for type:  {code}
The more common issue I've come across though is something like the following 
not being supported:
{code:java}
df = spark.createDataFrame([np.arange(10), np.arange(10)]){code}
(I.e. where each row could be a numpy list and/or the overall input list of 
rows is wrapped in a numpy array also).
 
I'd be happy to take the work on for this PR- whether the first possible case, 
or the second (it would be my first contribution, so if you/ anyone else think 
this is more complex than I'm currently estimating below, let me know).
 
For creating a dataframe with a flat iterable, this looks like it would be an 
addition of a createDataFrame function around 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L700]
 
For the second problem- i.e. where the input model remains the same, but rows 
are provided as numpy arrays; I'd be keen to re-use numpy's ndarray tolist() 
function here. Not only does this push the underlying array object into Python 
lists, which PySpark already supports, but it also has the benefit of 
converting list items of the numpy-specific types to Python native scalars. 
From a brief glance I've taken, it looks like this would need to be taken into 
account in three places:
 
 - In the set of prepare functions in session.py

[jira] [Commented] (SPARK-37697) Make it easier to convert numpy arrays to Spark Dataframes

2021-12-26 Thread Daniel Davies (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465443#comment-17465443
 ] 

Daniel Davies commented on SPARK-37697:
---

Hey Douglas,
 
I've definitely been caught by numpy types a few times in some of our spark 
workflows, so would be keen to solve in PySpark directly also.
 
Do you want to be able to create a DataFrame from a list of numpy numbers 
directly? This is supported in Pandas, but I think it's not possible to do this 
even with native Python types in Spark (e.g. my understanding is that the input 
to createDataFrame is required to be an iterable of rows), so maybe there's a 
discussion around supporting that. For example, running the below:
{code:java}
df = spark.createDataFrame([1,2,3,4,5]){code}
Fails with:
{code:java}
TypeError: Can not infer schema for type:  {code}
The more common issue I've come across though is something like the following 
not being supported:
{code:java}
df = spark.createDataFrame([np.arange(10), np.arange(10)]){code}
(I.e. where each row could be a numpy list and/or the overall input list of 
rows is wrapped in a numpy array also).
 
I'd be happy to take the work on for this PR- whether the first possible case, 
or the second (it would be my first contribution, so if you/ anyone else think 
this is more complex than I'm currently estimating below, let me know).
 
For creating a dataframe with a flat iterable, this looks like it would be an 
addition of a createDataFrame function around 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L700]
 
For the second problem- i.e. where the input model remains the same, but rows 
are provided as numpy arrays; I'd be keen to re-use numpy's ndarray tolist() 
function here. Not only does this push the underlying array object into Python 
lists, which PySpark already supports, but it also has the benefit of 
converting list items of the numpy-specific types to Python native scalars. 
From a brief glance I've taken, it looks like this would need to be taken into 
account in three places:
 
- In the set of prepare functions in session.py 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L912]
- In the conversion function 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1447]
- In the schema inference function 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1280]
 
This would work for inputs where rows are numpy arrays of any type; but a bit 
more work would be needed to make a row like the following work:
 
{code:java}
[1, 2, numpy.int64(3)]{code}
 
Hope that all makes sense- let me know which of the two problems you meant & 
are more interested in solving.
 
I'd also be keen to get a review of whether any of my solutions made sense

> Make it easier to convert numpy arrays to Spark Dataframes
> --
>
> Key: SPARK-37697
> URL: https://issues.apache.org/jira/browse/SPARK-37697
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Douglas Moore
>Priority: Major
>
> Make it easier to convert numpy arrays to dataframes.
> Often we receive errors:
>  
> {code:java}
> df = spark.createDataFrame(numpy.arange(10))
> Can not infer schema for type: 
> {code}
>  
> OR
> {code:java}
> df = spark.createDataFrame(numpy.arange(10.))
> Can not infer schema for type: 
> {code}
>  
> Today (Spark 3.x) we have to:
> {code:java}
> spark.createDataFrame(pd.DataFrame(numpy.arange(10.))) {code}
> Make this easier with a direct conversion from Numpy arrays to Spark 
> Dataframes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37697) Make it easier to convert numpy arrays to Spark Dataframes

2021-12-26 Thread Daniel Davies (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465443#comment-17465443
 ] 

Daniel Davies edited comment on SPARK-37697 at 12/26/21, 8:22 PM:
--

Hey Douglas,
 
I've definitely been caught by numpy types a few times in some of our spark 
workflows, so would be keen to solve in PySpark directly also.
 
Do you want to be able to create a DataFrame from a list of numpy numbers 
directly? This is supported in Pandas, but I think it's not possible to do this 
even with native Python types in Spark (e.g. my understanding is that the input 
to createDataFrame is required to be an iterable of rows), so maybe there's a 
discussion around supporting that. For example, running the below:
{code:java}
df = spark.createDataFrame([1,2,3,4,5]){code}
Fails with:
{code:java}
TypeError: Can not infer schema for type:  {code}
The more common issue I've come across though is something like the following 
not being supported:
{code:java}
df = spark.createDataFrame([np.arange(10), np.arange(10)]){code}
(I.e. where each row could be a numpy list and/or the overall input list of 
rows is wrapped in a numpy array also).
 
I'd be happy to take the work on for this PR- whether the first possible case, 
or the second (it would be my first contribution, so if you/ anyone else think 
this is more complex than I'm currently estimating below, let me know).
 
For creating a dataframe with a flat iterable, this looks like it would be an 
addition of a createDataFrame function around 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L700]
 
For the second problem- i.e. where the input model remains the same, but rows 
are provided as numpy arrays; I'd be keen to re-use numpy's ndarray tolist() 
function here. Not only does this push the underlying array object into Python 
lists, which PySpark already supports, but it also has the benefit of 
converting list items of the numpy-specific types to Python native scalars. 
From a brief glance I've taken, it looks like this would need to be taken into 
account in three places:
 
 - In the set of prepare functions in session.py 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L912]
 - In the conversion function 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1447]
 - In the schema inference function 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1280]
 
This would work for inputs where rows are numpy arrays of any type; but a bit 
more work would be needed to make a row like the following work:
 
{code:java}
[1, 2, numpy.int64(3)]{code}


was (Author: JIRAUSER282609):
Hey Douglas,
 
I've definitely been caught by numpy types a few times in some of our spark 
workflows, so would be keen to solve in PySpark directly also.
 
Do you want to be able to create a DataFrame from a list of numpy numbers 
directly? This is supported in Pandas, but I think it's not possible to do this 
even with native Python types in Spark (e.g. my understanding is that the input 
to createDataFrame is required to be an iterable of rows), so maybe there's a 
discussion around supporting that. For example, running the below:
{code:java}
df = spark.createDataFrame([1,2,3,4,5]){code}
Fails with:
{code:java}
TypeError: Can not infer schema for type:  {code}
The more common issue I've come across though is something like the following 
not being supported:
{code:java}
df = spark.createDataFrame([np.arange(10), np.arange(10)]){code}
(I.e. where each row could be a numpy list and/or the overall input list of 
rows is wrapped in a numpy array also).
 
I'd be happy to take the work on for this PR- whether the first possible case, 
or the second (it would be my first contribution, so if you/ anyone else think 
this is more complex than I'm currently estimating below, let me know).
 
For creating a dataframe with a flat iterable, this looks like it would be an 
addition of a createDataFrame function around 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L700]
 
For the second problem- i.e. where the input model remains the same, but rows 
are provided as numpy arrays; I'd be keen to re-use numpy's ndarray tolist() 
function here. Not only does this push the underlying array object into Python 
lists, which PySpark already supports, but it also has the benefit of 
converting list items of the numpy-specific types to Python native scalars. 
From a brief glance I've taken, it looks like this would need to be taken into 
account in three places:
 
- In the set of prepare functions in session.py 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L912]
- In the conversion function 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1447]
- In the schema inference function

[jira] [Commented] (SPARK-37738) PySpark date_add only accepts an integer as it's second parameter

2021-12-26 Thread Daniel Davies (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465417#comment-17465417
 ] 

Daniel Davies commented on SPARK-37738:
---

Thanks for the quick reply [~hyukjin.kwon] - I'll take a look over the next few 
days!

> PySpark date_add only accepts an integer as it's second parameter
> -
>
> Key: SPARK-37738
> URL: https://issues.apache.org/jira/browse/SPARK-37738
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Daniel Davies
>Priority: Minor
>
> Hello,
> I have a quick question regarding the PySpark date_add function (and it's 
> related functions I guess). Using date_add as an example, the PySpark API 
> takes a [column, and an int as it's second parameter.|#L2203]]
> This feels a bit weird, since the underlying SQL expression can take a column 
> as the second parameter also- in fact, to my limited understanding, the scala 
> [API 
> itself|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3114]
>  just calls lit on this second parameter anyway. Is there a reason date_add 
> doesn't support a column type as the second parameter in PySpark?
> This isn't a major issue, as the alternative is of course to just use 
> date_add in an expr statement- I just wondered what the usability is being 
> traded off for. I'm happy to contribute a PR if this is something that would 
> be worthwhile pursuing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37746) log4j2-defaults.properties is not working since log4j 2 is always initialized by default

2021-12-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465405#comment-17465405
 ] 

Apache Spark commented on SPARK-37746:
--

User 'chia7712' has created a pull request for this issue:
https://github.com/apache/spark/pull/35026

> log4j2-defaults.properties is not working since log4j 2 is always initialized 
> by default
> 
>
> Key: SPARK-37746
> URL: https://issues.apache.org/jira/browse/SPARK-37746
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> the code used to check initialization is shown below.
>       val log4j2Initialized = !LogManager.getRootLogger
>         
> .asInstanceOf[org.apache.logging.log4j.core.Logger].getAppenders.isEmpty
> That works for log4j. However, log4j2 provides a default configuration so 
> there is always a appender (ConsoleAppender) with error level.
>  
> reference from 
> [https://logging.apache.org/log4j/2.x/manual/configuration.html#AutomaticConfiguration]
> Log4j will provide a default configuration if it cannot locate a 
> configuration file. The default configuration, provided in the 
> DefaultConfiguration class, will set up:
>  * A 
> [ConsoleAppender|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/appender/ConsoleAppender.html]
>  attached to the root logger.
>  * A 
> [PatternLayout|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/layout/PatternLayout.html]
>  set to the pattern "%d\{HH:mm:ss.SSS} [%t] %-5level %logger\{36} - %msg%n" 
> attached to the ConsoleAppender
> Note that by default Log4j assigns the root logger to Level.ERROR.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37746) log4j2-defaults.properties is not working since log4j 2 is always initialized by default

2021-12-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465406#comment-17465406
 ] 

Apache Spark commented on SPARK-37746:
--

User 'chia7712' has created a pull request for this issue:
https://github.com/apache/spark/pull/35026

> log4j2-defaults.properties is not working since log4j 2 is always initialized 
> by default
> 
>
> Key: SPARK-37746
> URL: https://issues.apache.org/jira/browse/SPARK-37746
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> the code used to check initialization is shown below.
>       val log4j2Initialized = !LogManager.getRootLogger
>         
> .asInstanceOf[org.apache.logging.log4j.core.Logger].getAppenders.isEmpty
> That works for log4j. However, log4j2 provides a default configuration so 
> there is always a appender (ConsoleAppender) with error level.
>  
> reference from 
> [https://logging.apache.org/log4j/2.x/manual/configuration.html#AutomaticConfiguration]
> Log4j will provide a default configuration if it cannot locate a 
> configuration file. The default configuration, provided in the 
> DefaultConfiguration class, will set up:
>  * A 
> [ConsoleAppender|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/appender/ConsoleAppender.html]
>  attached to the root logger.
>  * A 
> [PatternLayout|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/layout/PatternLayout.html]
>  set to the pattern "%d\{HH:mm:ss.SSS} [%t] %-5level %logger\{36} - %msg%n" 
> attached to the ConsoleAppender
> Note that by default Log4j assigns the root logger to Level.ERROR.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37746) log4j2-defaults.properties is not working since log4j 2 is always initialized by default

2021-12-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37746:


Assignee: Apache Spark

> log4j2-defaults.properties is not working since log4j 2 is always initialized 
> by default
> 
>
> Key: SPARK-37746
> URL: https://issues.apache.org/jira/browse/SPARK-37746
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Assignee: Apache Spark
>Priority: Minor
>
> the code used to check initialization is shown below.
>       val log4j2Initialized = !LogManager.getRootLogger
>         
> .asInstanceOf[org.apache.logging.log4j.core.Logger].getAppenders.isEmpty
> That works for log4j. However, log4j2 provides a default configuration so 
> there is always a appender (ConsoleAppender) with error level.
>  
> reference from 
> [https://logging.apache.org/log4j/2.x/manual/configuration.html#AutomaticConfiguration]
> Log4j will provide a default configuration if it cannot locate a 
> configuration file. The default configuration, provided in the 
> DefaultConfiguration class, will set up:
>  * A 
> [ConsoleAppender|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/appender/ConsoleAppender.html]
>  attached to the root logger.
>  * A 
> [PatternLayout|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/layout/PatternLayout.html]
>  set to the pattern "%d\{HH:mm:ss.SSS} [%t] %-5level %logger\{36} - %msg%n" 
> attached to the ConsoleAppender
> Note that by default Log4j assigns the root logger to Level.ERROR.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37746) log4j2-defaults.properties is not working since log4j 2 is always initialized by default

2021-12-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37746:


Assignee: (was: Apache Spark)

> log4j2-defaults.properties is not working since log4j 2 is always initialized 
> by default
> 
>
> Key: SPARK-37746
> URL: https://issues.apache.org/jira/browse/SPARK-37746
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> the code used to check initialization is shown below.
>       val log4j2Initialized = !LogManager.getRootLogger
>         
> .asInstanceOf[org.apache.logging.log4j.core.Logger].getAppenders.isEmpty
> That works for log4j. However, log4j2 provides a default configuration so 
> there is always a appender (ConsoleAppender) with error level.
>  
> reference from 
> [https://logging.apache.org/log4j/2.x/manual/configuration.html#AutomaticConfiguration]
> Log4j will provide a default configuration if it cannot locate a 
> configuration file. The default configuration, provided in the 
> DefaultConfiguration class, will set up:
>  * A 
> [ConsoleAppender|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/appender/ConsoleAppender.html]
>  attached to the root logger.
>  * A 
> [PatternLayout|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/layout/PatternLayout.html]
>  set to the pattern "%d\{HH:mm:ss.SSS} [%t] %-5level %logger\{36} - %msg%n" 
> attached to the ConsoleAppender
> Note that by default Log4j assigns the root logger to Level.ERROR.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37746) log4j2-defaults.properties is not working since log4j 2 is always initialized by default

2021-12-26 Thread Chia-Ping Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai updated SPARK-37746:
---
Description: 
the code used to check initialization is shown below.

      val log4j2Initialized = !LogManager.getRootLogger
        .asInstanceOf[org.apache.logging.log4j.core.Logger].getAppenders.isEmpty

That works for log4j. However, log4j2 provides a default configuration so there 
is always a appender (ConsoleAppender) with error level.

 

according to  
[https://logging.apache.org/log4j/2.x/manual/configuration.html#AutomaticConfiguration]

Log4j will provide a default configuration if it cannot locate a configuration 
file. The default configuration, provided in the DefaultConfiguration class, 
will set up:
 * A 
[ConsoleAppender|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/appender/ConsoleAppender.html]
 attached to the root logger.
 * A 
[PatternLayout|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/layout/PatternLayout.html]
 set to the pattern "%d\{HH:mm:ss.SSS} [%t] %-5level %logger\{36} - %msg%n" 
attached to the ConsoleAppender

Note that by default Log4j assigns the root logger to Level.ERROR.

 

 

 

  was:
the code used to check initialization is shown below.

      val log4j2Initialized = !LogManager.getRootLogger
        .asInstanceOf[org.apache.logging.log4j.core.Logger].getAppenders.isEmpty

That works for log4j. However, log4j2 provides a default configuration so there 
is always a appender (ConsoleAppender) with error level.


> log4j2-defaults.properties is not working since log4j 2 is always initialized 
> by default
> 
>
> Key: SPARK-37746
> URL: https://issues.apache.org/jira/browse/SPARK-37746
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> the code used to check initialization is shown below.
>       val log4j2Initialized = !LogManager.getRootLogger
>         
> .asInstanceOf[org.apache.logging.log4j.core.Logger].getAppenders.isEmpty
> That works for log4j. However, log4j2 provides a default configuration so 
> there is always a appender (ConsoleAppender) with error level.
>  
> according to  
> [https://logging.apache.org/log4j/2.x/manual/configuration.html#AutomaticConfiguration]
> Log4j will provide a default configuration if it cannot locate a 
> configuration file. The default configuration, provided in the 
> DefaultConfiguration class, will set up:
>  * A 
> [ConsoleAppender|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/appender/ConsoleAppender.html]
>  attached to the root logger.
>  * A 
> [PatternLayout|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/layout/PatternLayout.html]
>  set to the pattern "%d\{HH:mm:ss.SSS} [%t] %-5level %logger\{36} - %msg%n" 
> attached to the ConsoleAppender
> Note that by default Log4j assigns the root logger to Level.ERROR.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37746) log4j2-defaults.properties is not working since log4j 2 is always initialized by default

2021-12-26 Thread Chia-Ping Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai updated SPARK-37746:
---
Description: 
the code used to check initialization is shown below.

      val log4j2Initialized = !LogManager.getRootLogger
        .asInstanceOf[org.apache.logging.log4j.core.Logger].getAppenders.isEmpty

That works for log4j. However, log4j2 provides a default configuration so there 
is always a appender (ConsoleAppender) with error level.

 

reference from 
[https://logging.apache.org/log4j/2.x/manual/configuration.html#AutomaticConfiguration]

Log4j will provide a default configuration if it cannot locate a configuration 
file. The default configuration, provided in the DefaultConfiguration class, 
will set up:
 * A 
[ConsoleAppender|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/appender/ConsoleAppender.html]
 attached to the root logger.
 * A 
[PatternLayout|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/layout/PatternLayout.html]
 set to the pattern "%d\{HH:mm:ss.SSS} [%t] %-5level %logger\{36} - %msg%n" 
attached to the ConsoleAppender

Note that by default Log4j assigns the root logger to Level.ERROR.

 

 

 

  was:
the code used to check initialization is shown below.

      val log4j2Initialized = !LogManager.getRootLogger
        .asInstanceOf[org.apache.logging.log4j.core.Logger].getAppenders.isEmpty

That works for log4j. However, log4j2 provides a default configuration so there 
is always a appender (ConsoleAppender) with error level.

 

according to  
[https://logging.apache.org/log4j/2.x/manual/configuration.html#AutomaticConfiguration]

Log4j will provide a default configuration if it cannot locate a configuration 
file. The default configuration, provided in the DefaultConfiguration class, 
will set up:
 * A 
[ConsoleAppender|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/appender/ConsoleAppender.html]
 attached to the root logger.
 * A 
[PatternLayout|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/layout/PatternLayout.html]
 set to the pattern "%d\{HH:mm:ss.SSS} [%t] %-5level %logger\{36} - %msg%n" 
attached to the ConsoleAppender

Note that by default Log4j assigns the root logger to Level.ERROR.

 

 

 


> log4j2-defaults.properties is not working since log4j 2 is always initialized 
> by default
> 
>
> Key: SPARK-37746
> URL: https://issues.apache.org/jira/browse/SPARK-37746
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> the code used to check initialization is shown below.
>       val log4j2Initialized = !LogManager.getRootLogger
>         
> .asInstanceOf[org.apache.logging.log4j.core.Logger].getAppenders.isEmpty
> That works for log4j. However, log4j2 provides a default configuration so 
> there is always a appender (ConsoleAppender) with error level.
>  
> reference from 
> [https://logging.apache.org/log4j/2.x/manual/configuration.html#AutomaticConfiguration]
> Log4j will provide a default configuration if it cannot locate a 
> configuration file. The default configuration, provided in the 
> DefaultConfiguration class, will set up:
>  * A 
> [ConsoleAppender|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/appender/ConsoleAppender.html]
>  attached to the root logger.
>  * A 
> [PatternLayout|https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/layout/PatternLayout.html]
>  set to the pattern "%d\{HH:mm:ss.SSS} [%t] %-5level %logger\{36} - %msg%n" 
> attached to the ConsoleAppender
> Note that by default Log4j assigns the root logger to Level.ERROR.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37746) log4j2-defaults.properties is not working since log4j 2 is always initialized by default

2021-12-26 Thread Chia-Ping Tsai (Jira)

Chia-Ping Tsai created SPARK-37746:
--

 Summary: log4j2-defaults.properties is not working since log4j 2 
is always initialized by default
 Key: SPARK-37746
 URL: https://issues.apache.org/jira/browse/SPARK-37746
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Chia-Ping Tsai


the code used to check initialization is shown below.

      val log4j2Initialized = !LogManager.getRootLogger
        .asInstanceOf[org.apache.logging.log4j.core.Logger].getAppenders.isEmpty

That works for log4j. However, log4j2 provides a default configuration so there 
is always a appender (ConsoleAppender) with error level.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37742) Remove obsolete code in testing scripts for Jenkins

2021-12-26 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-37742:
-
Summary: Remove obsolete code in testing scripts for Jenkins  (was: Remove 
obsolate codes in testing scripts for Jenkins)

> Remove obsolete code in testing scripts for Jenkins
> ---
>
> Key: SPARK-37742
> URL: https://issues.apache.org/jira/browse/SPARK-37742
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> e.g.)
> https://github.com/apache/spark/blob/master/dev/run-tests-jenkins
> https://github.com/apache/spark/blob/master/dev/run-tests-jenkins.py
> Jenkins specific logics at 
> https://github.com/apache/spark/blob/master/dev/run-tests.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37745) Add new option to CSVOption ignoreEmptyLines

2021-12-26 Thread Izek Greenfield (Jira)

Izek Greenfield created SPARK-37745:
---

 Summary: Add new option to CSVOption ignoreEmptyLines
 Key: SPARK-37745
 URL: https://issues.apache.org/jira/browse/SPARK-37745
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Izek Greenfield


In many cases, user need to read full CSV file with all empty lines so it will 
be good to have such option the default will be true so the default behavior 
will not changed



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

39 matches

Mail list logo