[jira] [Commented] (SPARK-26000) Missing block when reading HDFS Data from Cloudera Manager
[ https://issues.apache.org/jira/browse/SPARK-26000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682259#comment-16682259 ] Yuming Wang commented on SPARK-26000: - It is not a Spark issue, Maybe you need to increase {{dfs.datanode.handler.count}}. > Missing block when reading HDFS Data from Cloudera Manager > -- > > Key: SPARK-26000 > URL: https://issues.apache.org/jira/browse/SPARK-26000 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: john >Priority: Major > > I am able to write to Cloudera Manager HDFS through Open Source Spark which > runs separately. but not able to read the Cloudera Manger HDFS data . > > I am getting missing block location, socketTimeOut. > > spark.read().textfile(path_to_file) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26001) Reduce memory copy when writing decimal
[ https://issues.apache.org/jira/browse/SPARK-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-26001: Affects Version/s: (was: 2.5.0) 3.0.0 > Reduce memory copy when writing decimal > --- > > Key: SPARK-26001 > URL: https://issues.apache.org/jira/browse/SPARK-26001 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: caoxuewen >Priority: Major > > this PR fix 2 here: > - when writing non-null decimals, we not zero-out all the 16 allocated bytes. > if the number of bytes needed for a decimal is greater than 8. then we not > need zero-out between 0-byte and 8-byte. The first 8-byte will be covered > when writing decimal. > - when writing null decimals, we not zero-out all the 16 allocated bytes. > BitSetMethods.set the label for null and the length of decimal to 0. when we > get the decimal, will not access the 16 byte memory value, so this is safe. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25102) Write Spark version to ORC/Parquet file metadata
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25102. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 3.0.0 > Write Spark version to ORC/Parquet file metadata > > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zoltan Ivanfi >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Currently, Spark writes Spark version number into Hive Table properties with > `spark.sql.create.version`. > {code} > parameters:{ > spark.sql.sources.schema.part.0={ > "type":"struct", > "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] > }, > transient_lastDdlTime=1541142761, > spark.sql.sources.schema.numParts=1, > spark.sql.create.version=2.4.0 > } > {code} > This issue aims to write Spark versions to ORC/Parquet file metadata with > `org.apache.spark.sql.create.version`. It's different from Hive Table > property key `spark.sql.create.version`. It seems that we cannot change that > for backward compatibility (even in Apache Spark 3.0) > *ORC* > {code} > User Metadata: > org.apache.spark.sql.create.version=3.0.0-SNAPSHOT > {code} > *PARQUET* > {code} > file: > file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26001) Reduce memory copy when writing decimal
[ https://issues.apache.org/jira/browse/SPARK-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26001: Assignee: (was: Apache Spark) > Reduce memory copy when writing decimal > --- > > Key: SPARK-26001 > URL: https://issues.apache.org/jira/browse/SPARK-26001 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Priority: Major > > this PR fix 2 here: > - when writing non-null decimals, we not zero-out all the 16 allocated bytes. > if the number of bytes needed for a decimal is greater than 8. then we not > need zero-out between 0-byte and 8-byte. The first 8-byte will be covered > when writing decimal. > - when writing null decimals, we not zero-out all the 16 allocated bytes. > BitSetMethods.set the label for null and the length of decimal to 0. when we > get the decimal, will not access the 16 byte memory value, so this is safe. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26001) Reduce memory copy when writing decimal
[ https://issues.apache.org/jira/browse/SPARK-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682238#comment-16682238 ] Apache Spark commented on SPARK-26001: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/22998 > Reduce memory copy when writing decimal > --- > > Key: SPARK-26001 > URL: https://issues.apache.org/jira/browse/SPARK-26001 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Priority: Major > > this PR fix 2 here: > - when writing non-null decimals, we not zero-out all the 16 allocated bytes. > if the number of bytes needed for a decimal is greater than 8. then we not > need zero-out between 0-byte and 8-byte. The first 8-byte will be covered > when writing decimal. > - when writing null decimals, we not zero-out all the 16 allocated bytes. > BitSetMethods.set the label for null and the length of decimal to 0. when we > get the decimal, will not access the 16 byte memory value, so this is safe. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26001) Reduce memory copy when writing decimal
[ https://issues.apache.org/jira/browse/SPARK-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26001: Assignee: Apache Spark > Reduce memory copy when writing decimal > --- > > Key: SPARK-26001 > URL: https://issues.apache.org/jira/browse/SPARK-26001 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Assignee: Apache Spark >Priority: Major > > this PR fix 2 here: > - when writing non-null decimals, we not zero-out all the 16 allocated bytes. > if the number of bytes needed for a decimal is greater than 8. then we not > need zero-out between 0-byte and 8-byte. The first 8-byte will be covered > when writing decimal. > - when writing null decimals, we not zero-out all the 16 allocated bytes. > BitSetMethods.set the label for null and the length of decimal to 0. when we > get the decimal, will not access the 16 byte memory value, so this is safe. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26001) Reduce memory copy when writing decimal
caoxuewen created SPARK-26001: - Summary: Reduce memory copy when writing decimal Key: SPARK-26001 URL: https://issues.apache.org/jira/browse/SPARK-26001 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.5.0 Reporter: caoxuewen this PR fix 2 here: - when writing non-null decimals, we not zero-out all the 16 allocated bytes. if the number of bytes needed for a decimal is greater than 8. then we not need zero-out between 0-byte and 8-byte. The first 8-byte will be covered when writing decimal. - when writing null decimals, we not zero-out all the 16 allocated bytes. BitSetMethods.set the label for null and the length of decimal to 0. when we get the decimal, will not access the 16 byte memory value, so this is safe. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26000) Missing block when reading HDFS Data from Cloudera Manager
[ https://issues.apache.org/jira/browse/SPARK-26000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682163#comment-16682163 ] john commented on SPARK-26000: -- I have Cloudera Manager in Environment A which has HDFS component and Spark in B. I am doing a very sample read and write to/from HDFS. Writing to HDFS Cloudera Manager is working as expected when reading back i m getting below issues: "java.lang.reflect.InvocationTargetException" Caused By: "org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;" Caused By: "java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/SparkNode_IP_PORT_NoO remote=/NameNode:50010:" java Sample code // writing spark.write().mode("append").format("parquet").save(path_to_file); // read spark.read().parquet(path_to_file); > Missing block when reading HDFS Data from Cloudera Manager > -- > > Key: SPARK-26000 > URL: https://issues.apache.org/jira/browse/SPARK-26000 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: john >Priority: Major > > I am able to write to Cloudera Manager HDFS through Open Source Spark which > runs separately. but not able to read the Cloudera Manger HDFS data . > > I am getting missing block location, socketTimeOut. > > spark.read().textfile(path_to_file) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26000) Missing block when reading HDFS Data from Cloudera Manager
[ https://issues.apache.org/jira/browse/SPARK-26000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682151#comment-16682151 ] Yuming Wang commented on SPARK-26000: - Could you provide more information? > Missing block when reading HDFS Data from Cloudera Manager > -- > > Key: SPARK-26000 > URL: https://issues.apache.org/jira/browse/SPARK-26000 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: john >Priority: Major > > I am able to write to Cloudera Manager HDFS through Open Source Spark which > runs separately. but not able to read the Cloudera Manger HDFS data . > > I am getting missing block location, socketTimeOut. > > spark.read().textfile(path_to_file) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26000) Missing block when reading HDFS Data from Cloudera Manager
john created SPARK-26000: Summary: Missing block when reading HDFS Data from Cloudera Manager Key: SPARK-26000 URL: https://issues.apache.org/jira/browse/SPARK-26000 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.2 Reporter: john I am able to write to Cloudera Manager HDFS through Open Source Spark which runs separately. but not able to read the Cloudera Manger HDFS data . I am getting missing block location, socketTimeOut. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25993) Add test cases for resolution of ORC table location
[ https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682141#comment-16682141 ] kevin yu commented on SPARK-25993: -- I am looking into it now. Kevin > Add test cases for resolution of ORC table location > --- > > Key: SPARK-25993 > URL: https://issues.apache.org/jira/browse/SPARK-25993 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.2 >Reporter: Xiao Li >Priority: Major > Labels: starter > > Add a test case based on the following example. The behavior was changed in > 2.3 release. We also need to upgrade the migration guide. > {code:java} > val someDF1 = Seq( > (1, 1, "blah"), > (1, 2, "blahblah") > ).toDF("folder", "number", "word").repartition(1) > someDF1.write.orc("/tmp/orctab1/dir1/") > someDF1.write.orc("/mnt/orctab1/dir2/") > create external table tab1(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/"); > select * from tab1; > create external table tab2(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/*"); > select * from tab2; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release
[ https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682137#comment-16682137 ] Dongjoon Hyun commented on SPARK-24229: --- I also agree with [~vanzin] 's opinion. Since this is open, [~Fokko] already made a PR to this. I'll close this for now. Please reopen this with a reproducible test case. > Upgrade to the latest Apache Thrift 0.10.0 release > -- > > Key: SPARK-24229 > URL: https://issues.apache.org/jira/browse/SPARK-24229 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Ray Donnelly >Priority: Critical > > According to [https://www.cvedetails.com/cve/CVE-2016-5397/] > > .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored > in Apache Spark (and then, for us, into PySpark). > > Can anyone help to assess the seriousness of this and what should be done > about it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26000) Missing block when reading HDFS Data from Cloudera Manager
[ https://issues.apache.org/jira/browse/SPARK-26000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] john updated SPARK-26000: - Description: I am able to write to Cloudera Manager HDFS through Open Source Spark which runs separately. but not able to read the Cloudera Manger HDFS data . I am getting missing block location, socketTimeOut. spark.read().textfile(path_to_file) was: I am able to write to Cloudera Manager HDFS through Open Source Spark which runs separately. but not able to read the Cloudera Manger HDFS data . I am getting missing block location, socketTimeOut. > Missing block when reading HDFS Data from Cloudera Manager > -- > > Key: SPARK-26000 > URL: https://issues.apache.org/jira/browse/SPARK-26000 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: john >Priority: Major > > I am able to write to Cloudera Manager HDFS through Open Source Spark which > runs separately. but not able to read the Cloudera Manger HDFS data . > > I am getting missing block location, socketTimeOut. > > spark.read().textfile(path_to_file) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release
[ https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682137#comment-16682137 ] Dongjoon Hyun edited comment on SPARK-24229 at 11/10/18 1:40 AM: - I also agree with [~vanzin] 's opinion. Since this is open, [~Fokko] already made a PR to this. I'll close this for now to save Apache Spark community effort. Please reopen this with a reproducible test case. was (Author: dongjoon): I also agree with [~vanzin] 's opinion. Since this is open, [~Fokko] already made a PR to this. I'll close this for now. Please reopen this with a reproducible test case. > Upgrade to the latest Apache Thrift 0.10.0 release > -- > > Key: SPARK-24229 > URL: https://issues.apache.org/jira/browse/SPARK-24229 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Ray Donnelly >Priority: Critical > > According to [https://www.cvedetails.com/cve/CVE-2016-5397/] > > .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored > in Apache Spark (and then, for us, into PySpark). > > Can anyone help to assess the seriousness of this and what should be done > about it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release
[ https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-24229. --- Resolution: Not A Problem > Upgrade to the latest Apache Thrift 0.10.0 release > -- > > Key: SPARK-24229 > URL: https://issues.apache.org/jira/browse/SPARK-24229 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Ray Donnelly >Priority: Critical > > According to [https://www.cvedetails.com/cve/CVE-2016-5397/] > > .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored > in Apache Spark (and then, for us, into PySpark). > > Can anyone help to assess the seriousness of this and what should be done > about it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25999: Assignee: (was: Apache Spark) > make-distribution.sh failure with --r and -Phadoop-provided > --- > > Key: SPARK-25999 > URL: https://issues.apache.org/jira/browse/SPARK-25999 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > Attachments: SPARK-25999.patch > > > It is not possible to build a distribution that doesn't contain hadoop > dependencies but include SparkR. This is because R/check_cran.sh builds R > document which depends on hadoop dependencies in > assembly/target/scala-xxx/jars folder. > To reproduce: > MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive > -Psparkr -Phadoop-provided" > ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS > > Error: > * creating vignettes ... ERROR > ... > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682129#comment-16682129 ] Apache Spark commented on SPARK-25999: -- User 'shanyu' has created a pull request for this issue: https://github.com/apache/spark/pull/22997 > make-distribution.sh failure with --r and -Phadoop-provided > --- > > Key: SPARK-25999 > URL: https://issues.apache.org/jira/browse/SPARK-25999 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > Attachments: SPARK-25999.patch > > > It is not possible to build a distribution that doesn't contain hadoop > dependencies but include SparkR. This is because R/check_cran.sh builds R > document which depends on hadoop dependencies in > assembly/target/scala-xxx/jars folder. > To reproduce: > MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive > -Psparkr -Phadoop-provided" > ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS > > Error: > * creating vignettes ... ERROR > ... > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682126#comment-16682126 ] Apache Spark commented on SPARK-25999: -- User 'shanyu' has created a pull request for this issue: https://github.com/apache/spark/pull/22997 > make-distribution.sh failure with --r and -Phadoop-provided > --- > > Key: SPARK-25999 > URL: https://issues.apache.org/jira/browse/SPARK-25999 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > Attachments: SPARK-25999.patch > > > It is not possible to build a distribution that doesn't contain hadoop > dependencies but include SparkR. This is because R/check_cran.sh builds R > document which depends on hadoop dependencies in > assembly/target/scala-xxx/jars folder. > To reproduce: > MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive > -Psparkr -Phadoop-provided" > ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS > > Error: > * creating vignettes ... ERROR > ... > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25999: Assignee: Apache Spark > make-distribution.sh failure with --r and -Phadoop-provided > --- > > Key: SPARK-25999 > URL: https://issues.apache.org/jira/browse/SPARK-25999 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Assignee: Apache Spark >Priority: Major > Attachments: SPARK-25999.patch > > > It is not possible to build a distribution that doesn't contain hadoop > dependencies but include SparkR. This is because R/check_cran.sh builds R > document which depends on hadoop dependencies in > assembly/target/scala-xxx/jars folder. > To reproduce: > MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive > -Psparkr -Phadoop-provided" > ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS > > Error: > * creating vignettes ... ERROR > ... > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682122#comment-16682122 ] Yuming Wang commented on SPARK-25999: - Please create pull request at: https://github.com/apache/spark/pulls > make-distribution.sh failure with --r and -Phadoop-provided > --- > > Key: SPARK-25999 > URL: https://issues.apache.org/jira/browse/SPARK-25999 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > Attachments: SPARK-25999.patch > > > It is not possible to build a distribution that doesn't contain hadoop > dependencies but include SparkR. This is because R/check_cran.sh builds R > document which depends on hadoop dependencies in > assembly/target/scala-xxx/jars folder. > To reproduce: > MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive > -Psparkr -Phadoop-provided" > ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS > > Error: > * creating vignettes ... ERROR > ... > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682089#comment-16682089 ] shanyu zhao commented on SPARK-25999: - patch attached. Basically it creates an optional project that brings all dependencies to R/rjarsdep/target folder, and copy the missing jars to assembly/target folder before building R. > make-distribution.sh failure with --r and -Phadoop-provided > --- > > Key: SPARK-25999 > URL: https://issues.apache.org/jira/browse/SPARK-25999 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > Attachments: SPARK-25999.patch > > > It is not possible to build a distribution that doesn't contain hadoop > dependencies but include SparkR. This is because R/check_cran.sh builds R > document which depends on hadoop dependencies in > assembly/target/scala-xxx/jars folder. > To reproduce: > MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive > -Psparkr -Phadoop-provided" > ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS > > Error: > * creating vignettes ... ERROR > ... > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-25999: Attachment: SPARK-25999.patch > make-distribution.sh failure with --r and -Phadoop-provided > --- > > Key: SPARK-25999 > URL: https://issues.apache.org/jira/browse/SPARK-25999 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > Attachments: SPARK-25999.patch > > > It is not possible to build a distribution that doesn't contain hadoop > dependencies but include SparkR. This is because R/check_cran.sh builds R > document which depends on hadoop dependencies in > assembly/target/scala-xxx/jars folder. > To reproduce: > MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive > -Psparkr -Phadoop-provided" > ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS > > Error: > * creating vignettes ... ERROR > ... > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-25999: Summary: make-distribution.sh failure with --r and -Phadoop-provided (was: Spark make-distribution failure with --r and -Phadoop-provided) > make-distribution.sh failure with --r and -Phadoop-provided > --- > > Key: SPARK-25999 > URL: https://issues.apache.org/jira/browse/SPARK-25999 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > > It is not possible to build a distribution that doesn't contain hadoop > dependencies but include SparkR. This is because R/check_cran.sh builds R > document which depends on hadoop dependencies in > assembly/target/scala-xxx/jars folder. > To reproduce: > MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive > -Psparkr -Phadoop-provided" > ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS > > Error: > * creating vignettes ... ERROR > ... > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25999) Spark make-distribution failure with --r and -Phadoop-provided
shanyu zhao created SPARK-25999: --- Summary: Spark make-distribution failure with --r and -Phadoop-provided Key: SPARK-25999 URL: https://issues.apache.org/jira/browse/SPARK-25999 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.0, 2.3.2 Reporter: shanyu zhao It is not possible to build a distribution that doesn't contain hadoop dependencies but include SparkR. This is because R/check_cran.sh builds R document which depends on hadoop dependencies in assembly/target/scala-xxx/jars folder. To reproduce: MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive -Psparkr -Phadoop-provided" ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS Error: * creating vignettes ... ERROR ... Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml
[ https://issues.apache.org/jira/browse/SPARK-25997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682023#comment-16682023 ] Apache Spark commented on SPARK-25997: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/22996 > Python example code for Power Iteration Clustering in spark.ml > -- > > Key: SPARK-25997 > URL: https://issues.apache.org/jira/browse/SPARK-25997 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > > Add a python example code for Power iteration clustering in spark.ml examples -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml
[ https://issues.apache.org/jira/browse/SPARK-25997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25997: Assignee: Apache Spark > Python example code for Power Iteration Clustering in spark.ml > -- > > Key: SPARK-25997 > URL: https://issues.apache.org/jira/browse/SPARK-25997 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > Add a python example code for Power iteration clustering in spark.ml examples -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml
[ https://issues.apache.org/jira/browse/SPARK-25997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25997: Assignee: (was: Apache Spark) > Python example code for Power Iteration Clustering in spark.ml > -- > > Key: SPARK-25997 > URL: https://issues.apache.org/jira/browse/SPARK-25997 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > > Add a python example code for Power iteration clustering in spark.ml examples -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object
[ https://issues.apache.org/jira/browse/SPARK-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25998: Assignee: Apache Spark > TorrentBroadcast holds strong reference to broadcast object > --- > > Key: SPARK-25998 > URL: https://issues.apache.org/jira/browse/SPARK-25998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Brandon Krieger >Assignee: Apache Spark >Priority: Major > > If we do a large number of broadcast joins while holding onto the Dataset > reference, it will hold onto a large amount of memory for the value of the > broadcast object. The broadcast object is also held in the MemoryStore, but > that will clean itself up to prevent its memory usage from going over a > certain level. In my use case, I don't want to release the reference to the > Dataset (which would allow the broadcast object to be GCed) because I want to > be able to unpersist it at some point in the future (when it is no longer > relevant). > See the following repro in Spark shell: > {code:java} > import org.apache.spark.sql.functions._ > import org.apache.spark.SparkEnv > val startDf = (1 to 100).toDF("num").withColumn("num", > $"num".cast("string")).cache() > val leftDf = startDf.withColumn("num", concat($"num", lit("0"))) > val rightDf = startDf.withColumn("num", concat($"num", lit("1"))) > val broadcastJoinedDf = leftDf.join(broadcast(rightDf), > leftDf.col("num").eqNullSafe(rightDf.col("num"))) > broadcastJoinedDf.count > // Take a heap dump, see UnsafeHashedRelation with hard references in > MemoryStore and Dataset > // Force the MemoryStore to clear itself > SparkEnv.get.blockManager.stop > // Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now > referenced only by the Dataset. > {code} > If we make the TorrentBroadcast hold a weak reference to the broadcast > object, the second heap dump will show nothing; the UnsafeHashedRelation has > been GCed. > Given that the broadcast object can be reloaded from the MemoryStore, it > seems like it would be alright to use a WeakReference instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object
[ https://issues.apache.org/jira/browse/SPARK-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682017#comment-16682017 ] Apache Spark commented on SPARK-25998: -- User 'bkrieger' has created a pull request for this issue: https://github.com/apache/spark/pull/22995 > TorrentBroadcast holds strong reference to broadcast object > --- > > Key: SPARK-25998 > URL: https://issues.apache.org/jira/browse/SPARK-25998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Brandon Krieger >Priority: Major > > If we do a large number of broadcast joins while holding onto the Dataset > reference, it will hold onto a large amount of memory for the value of the > broadcast object. The broadcast object is also held in the MemoryStore, but > that will clean itself up to prevent its memory usage from going over a > certain level. In my use case, I don't want to release the reference to the > Dataset (which would allow the broadcast object to be GCed) because I want to > be able to unpersist it at some point in the future (when it is no longer > relevant). > See the following repro in Spark shell: > {code:java} > import org.apache.spark.sql.functions._ > import org.apache.spark.SparkEnv > val startDf = (1 to 100).toDF("num").withColumn("num", > $"num".cast("string")).cache() > val leftDf = startDf.withColumn("num", concat($"num", lit("0"))) > val rightDf = startDf.withColumn("num", concat($"num", lit("1"))) > val broadcastJoinedDf = leftDf.join(broadcast(rightDf), > leftDf.col("num").eqNullSafe(rightDf.col("num"))) > broadcastJoinedDf.count > // Take a heap dump, see UnsafeHashedRelation with hard references in > MemoryStore and Dataset > // Force the MemoryStore to clear itself > SparkEnv.get.blockManager.stop > // Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now > referenced only by the Dataset. > {code} > If we make the TorrentBroadcast hold a weak reference to the broadcast > object, the second heap dump will show nothing; the UnsafeHashedRelation has > been GCed. > Given that the broadcast object can be reloaded from the MemoryStore, it > seems like it would be alright to use a WeakReference instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object
[ https://issues.apache.org/jira/browse/SPARK-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25998: Assignee: (was: Apache Spark) > TorrentBroadcast holds strong reference to broadcast object > --- > > Key: SPARK-25998 > URL: https://issues.apache.org/jira/browse/SPARK-25998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Brandon Krieger >Priority: Major > > If we do a large number of broadcast joins while holding onto the Dataset > reference, it will hold onto a large amount of memory for the value of the > broadcast object. The broadcast object is also held in the MemoryStore, but > that will clean itself up to prevent its memory usage from going over a > certain level. In my use case, I don't want to release the reference to the > Dataset (which would allow the broadcast object to be GCed) because I want to > be able to unpersist it at some point in the future (when it is no longer > relevant). > See the following repro in Spark shell: > {code:java} > import org.apache.spark.sql.functions._ > import org.apache.spark.SparkEnv > val startDf = (1 to 100).toDF("num").withColumn("num", > $"num".cast("string")).cache() > val leftDf = startDf.withColumn("num", concat($"num", lit("0"))) > val rightDf = startDf.withColumn("num", concat($"num", lit("1"))) > val broadcastJoinedDf = leftDf.join(broadcast(rightDf), > leftDf.col("num").eqNullSafe(rightDf.col("num"))) > broadcastJoinedDf.count > // Take a heap dump, see UnsafeHashedRelation with hard references in > MemoryStore and Dataset > // Force the MemoryStore to clear itself > SparkEnv.get.blockManager.stop > // Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now > referenced only by the Dataset. > {code} > If we make the TorrentBroadcast hold a weak reference to the broadcast > object, the second heap dump will show nothing; the UnsafeHashedRelation has > been GCed. > Given that the broadcast object can be reloaded from the MemoryStore, it > seems like it would be alright to use a WeakReference instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object
Brandon Krieger created SPARK-25998: --- Summary: TorrentBroadcast holds strong reference to broadcast object Key: SPARK-25998 URL: https://issues.apache.org/jira/browse/SPARK-25998 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Brandon Krieger If we do a large number of broadcast joins while holding onto the Dataset reference, it will hold onto a large amount of memory for the value of the broadcast object. The broadcast object is also held in the MemoryStore, but that will clean itself up to prevent its memory usage from going over a certain level. In my use case, I don't want to release the reference to the Dataset (which would allow the broadcast object to be GCed) because I want to be able to unpersist it at some point in the future (when it is no longer relevant). See the following repro in Spark shell: {code:java} import org.apache.spark.sql.functions._ import org.apache.spark.SparkEnv val startDf = (1 to 100).toDF("num").withColumn("num", $"num".cast("string")).cache() val leftDf = startDf.withColumn("num", concat($"num", lit("0"))) val rightDf = startDf.withColumn("num", concat($"num", lit("1"))) val broadcastJoinedDf = leftDf.join(broadcast(rightDf), leftDf.col("num").eqNullSafe(rightDf.col("num"))) broadcastJoinedDf.count // Take a heap dump, see UnsafeHashedRelation with hard references in MemoryStore and Dataset // Force the MemoryStore to clear itself SparkEnv.get.blockManager.stop // Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now referenced only by the Dataset. {code} If we make the TorrentBroadcast hold a weak reference to the broadcast object, the second heap dump will show nothing; the UnsafeHashedRelation has been GCed. Given that the broadcast object can be reloaded from the MemoryStore, it seems like it would be alright to use a WeakReference instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml
Huaxin Gao created SPARK-25997: -- Summary: Python example code for Power Iteration Clustering in spark.ml Key: SPARK-25997 URL: https://issues.apache.org/jira/browse/SPARK-25997 Project: Spark Issue Type: Documentation Components: ML Affects Versions: 3.0.0 Reporter: Huaxin Gao Add a python example code for Power iteration clustering in spark.ml examples -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-24101: - Assignee: Ilya Matiach > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-24101 > URL: https://issues.apache.org/jira/browse/SPARK-24101 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Assignee: Ilya Matiach >Priority: Major > Fix For: 3.0.0 > > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24101. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 17086 [https://github.com/apache/spark/pull/17086] > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-24101 > URL: https://issues.apache.org/jira/browse/SPARK-24101 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Assignee: Ilya Matiach >Priority: Major > Fix For: 3.0.0 > > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25996) Agregaciones no retornan los valores correctos con rows con timestamps iguales
[ https://issues.apache.org/jira/browse/SPARK-25996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Gómez updated SPARK-25996: -- Description: Hi all, When using pyspark I perform a count operation prior to the previous date of the current row, including in the count the current row, with the corresponding query: query = """ select *,* count ( * ) over (partition by ACCOUNTID order by TS range between interval 5000 milliseconds preceding and current row) as total_count from df3 """ df3 = sqlContext.sql(query) and return the following: |ACCOUNTID|AMOUNT|TS|total_count| |1|100|2018-01-01 00:00:01|1| |1|1000|2018-01-01 10:00:01|1| |1|25|2018-01-01 10:00:02|2| |1|500|2018-01-01 10:00:03|3| |1|100|2018-01-01 10:00:04|4| |1|80|2018-01-01 10:00:05|5| |1|700|2018-01-01 11:00:04|1| |1|205|2018-01-02 10:00:02|1| |1|500|2018-01-02 10:00:03|2| |3|80|2018-01-02 10:00:05|1| As you can see, in the third row, the total_count should give 3 instead of 2 because there are 2 previous records and not 1. In the following rows, the error is dragged. This happens with the other aggregation operations. Beyond the fact that the date of the first rows is the same, that does not mean that these two exist and should not be considered as the only one that exists is the last one with the same date. Could you help me? Thank you was: Qué, tal? Al utilizar pyspark realizo una operación de conteo de registros previos a la fecha anterior de la row actual, incluyendo en el conteo la row actual, con la correspondiente query: query = """ select *,* count ( * ) over (partition by ACCOUNTID order by TS range between interval 5000 milliseconds preceding and current row) as total_count from df3 """ df3 = sqlContext.sql(query) y retorna lo siguiente: |ACCOUNTID|AMOUNT|TS|total_count| |1|100|2018-01-01 00:00:01|1| |1|1000|2018-01-01 10:00:01|1| |1|25|2018-01-01 10:00:02|2| |1|500|2018-01-01 10:00:03|3| |1|100|2018-01-01 10:00:04|4| |1|80|2018-01-01 10:00:05|5| |1|700|2018-01-01 11:00:04|1| |1|205|2018-01-02 10:00:02|1| |1|500|2018-01-02 10:00:03|2| |3|80|2018-01-02 10:00:05|1| Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se arrastra el error. Esto ocurre con las demás operaciones de agregación. Más allá de que la fecha de las primeras rows sea la misma, eso no quita que estas dos existan y no debería considerarse como que la única que existe es la última que tenga la misma fecha. Me podrían ayudar? Muchas gracias > Agregaciones no retornan los valores correctos con rows con timestamps iguales > -- > > Key: SPARK-25996 > URL: https://issues.apache.org/jira/browse/SPARK-25996 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1, 2.4.0 > Environment: Windows 10 > PyCharm 2018.2.2 > Python 3.6 > >Reporter: Ignacio Gómez >Priority: Major > > Hi all, > When using pyspark I perform a count operation prior to the previous date of > the current row, including in the count the current row, with the > corresponding query: > query = """ > select *,* count ( * ) over (partition by ACCOUNTID > order by TS > range between interval 5000 milliseconds preceding and current row) as > total_count > from df3 > """ > df3 = sqlContext.sql(query) > and return the following: > > |ACCOUNTID|AMOUNT|TS|total_count| > |1|100|2018-01-01 00:00:01|1| > |1|1000|2018-01-01 10:00:01|1| > |1|25|2018-01-01 10:00:02|2| > |1|500|2018-01-01 10:00:03|3| > |1|100|2018-01-01 10:00:04|4| > |1|80|2018-01-01 10:00:05|5| > |1|700|2018-01-01 11:00:04|1| > |1|205|2018-01-02 10:00:02|1| > |1|500|2018-01-02 10:00:03|2| > |3|80|2018-01-02 10:00:05|1| > > As you can see, in the third row, the total_count should give 3 instead of 2 > because there are 2 previous records and not 1. In the following rows, the > error is dragged. > This happens with the other aggregation operations. > Beyond the fact that the date of the first rows is the same, that does not > mean that these two exist and should not be considered as the only one that > exists is the last one with the same date. > > Could you help me? > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25996) Agregaciones no retornan los valores correctos con rows con timestamps iguales
[ https://issues.apache.org/jira/browse/SPARK-25996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Gómez updated SPARK-25996: -- Description: Qué, tal? Al utilizar pyspark realizo una operación de conteo de registros previos a la fecha anterior de la row actual, incluyendo en el conteo la row actual, con la correspondiente query: query = """ select *,* count ( * ) over (partition by ACCOUNTID order by TS range between interval 5000 milliseconds preceding and current row) as total_count from df3 """ df3 = sqlContext.sql(query) y retorna lo siguiente: |ACCOUNTID|AMOUNT|TS|total_count| |1|100|2018-01-01 00:00:01|1| |1|1000|2018-01-01 10:00:01|1| |1|25|2018-01-01 10:00:02|2| |1|500|2018-01-01 10:00:03|3| |1|100|2018-01-01 10:00:04|4| |1|80|2018-01-01 10:00:05|5| |1|700|2018-01-01 11:00:04|1| |1|205|2018-01-02 10:00:02|1| |1|500|2018-01-02 10:00:03|2| |3|80|2018-01-02 10:00:05|1| Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se arrastra el error. Esto ocurre con las demás operaciones de agregación. Más allá de que la fecha de las primeras rows sea la misma, eso no quita que estas dos existan y no debería considerarse como que la única que existe es la última que tenga la misma fecha. Me podrían ayudar? Muchas gracias was: Qué, tal? Al utilizar pyspark realizo una operación de conteo de registros previos a la fecha anterior de la row actual, incluyendo en el conteo la row actual, con la correspondiente query: query = """ select *,* count ( * ) over (partition by ACCOUNTID order by TS range between interval 5000 milliseconds preceding and current row) as total_count from df3 """ df3 = sqlContext.sql(query) y retorna lo siguiente: +-+--+---+---+ |ACCOUNTID|AMOUNT| TS|total_count| +-+--+---+---+ | 1| 100|2018-01-01 00:00:01| 1| | 1| 1000|2018-01-01 10:00:01| 1| | 1| 25|2018-01-01 10:00:02| 2| | 1| 500|2018-01-01 10:00:03| 3| | 1| 100|2018-01-01 10:00:04| 4| | 1| 80|2018-01-01 10:00:05| 5| | 1| 700|2018-01-01 11:00:04| 1| | 1| 205|2018-01-02 10:00:02| 1| | 1| 500|2018-01-02 10:00:03| 2| | 3| 80|2018-01-02 10:00:05| 1| Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se arrastra el error. Esto ocurre con las demás operaciones de agregación. Más allá de que la fecha de las primeras rows sea la misma, eso no quita que estas dos existan y no debería considerarse como que la única que existe es la última que tenga la misma fecha. Me podrían ayudar? Muchas gracias > Agregaciones no retornan los valores correctos con rows con timestamps iguales > -- > > Key: SPARK-25996 > URL: https://issues.apache.org/jira/browse/SPARK-25996 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1, 2.4.0 > Environment: Windows 10 > PyCharm 2018.2.2 > Python 3.6 > >Reporter: Ignacio Gómez >Priority: Major > > Qué, tal? > Al utilizar pyspark realizo una operación de conteo de registros previos a la > fecha anterior de la row actual, incluyendo en el conteo la row actual, con > la correspondiente query: > query = """ > select *,* count ( * ) over (partition by ACCOUNTID > order by TS > range between interval 5000 milliseconds preceding and current row) as > total_count > from df3 > """ > df3 = sqlContext.sql(query) > y retorna lo siguiente: > > |ACCOUNTID|AMOUNT|TS|total_count| > |1|100|2018-01-01 00:00:01|1| > |1|1000|2018-01-01 10:00:01|1| > |1|25|2018-01-01 10:00:02|2| > |1|500|2018-01-01 10:00:03|3| > |1|100|2018-01-01 10:00:04|4| > |1|80|2018-01-01 10:00:05|5| > |1|700|2018-01-01 11:00:04|1| > |1|205|2018-01-02 10:00:02|1| > |1|500|2018-01-02 10:00:03|2| > |3|80|2018-01-02 10:00:05|1| > Como se puede apreciar, en la tercera row, el total_count debería dar 3 en > vez de 2 porque existen 2 registros previos y no 1. En las rows siguientes, > se arrastra el error. > Esto ocurre con las demás operaciones de agregación. > Más allá de que la fecha de las primeras rows sea la misma, eso no quita que > estas dos existan y no debería considerarse como que la única que existe es > la última que tenga la misma fecha. > > Me podrían ayudar? > Muchas gracias -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25996) Agregaciones no retornan los valores correctos con rows con timestamps iguales
[ https://issues.apache.org/jira/browse/SPARK-25996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Gómez updated SPARK-25996: -- Description: Qué, tal? Al utilizar pyspark realizo una operación de conteo de registros previos a la fecha anterior de la row actual, incluyendo en el conteo la row actual, con la correspondiente query: query = """ select *,* count ( * ) over (partition by ACCOUNTID order by TS range between interval 5000 milliseconds preceding and current row) as total_count from df3 """ df3 = sqlContext.sql(query) y retorna lo siguiente: +-+--+---+---+ |ACCOUNTID|AMOUNT| TS|total_count| +-+--+---+---+ | 1| 100|2018-01-01 00:00:01| 1| | 1| 1000|2018-01-01 10:00:01| 1| | 1| 25|2018-01-01 10:00:02| 2| | 1| 500|2018-01-01 10:00:03| 3| | 1| 100|2018-01-01 10:00:04| 4| | 1| 80|2018-01-01 10:00:05| 5| | 1| 700|2018-01-01 11:00:04| 1| | 1| 205|2018-01-02 10:00:02| 1| | 1| 500|2018-01-02 10:00:03| 2| | 3| 80|2018-01-02 10:00:05| 1| Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se arrastra el error. Esto ocurre con las demás operaciones de agregación. Más allá de que la fecha de las primeras rows sea la misma, eso no quita que estas dos existan y no debería considerarse como que la única que existe es la última que tenga la misma fecha. Me podrían ayudar? Muchas gracias was: Qué, tal? Al utilizar pyspark realizo una operación de conteo de registros previos a la fecha anterior de la row actual, incluyendo en el conteo la row actual, con la correspondiente query: query = """ select *, count(*) over (partition by ACCOUNTID order by TS range between interval 5000 milliseconds preceding and current row) as total_count from df3 """ df3 = sqlContext.sql(query) y retorna lo siguiente: !image-2018-11-09-18-25-55-296.png! Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se arrastra el error. Esto ocurre con las demás operaciones de agregación. Más allá de que la fecha de las primeras rows sea la misma, eso no quita que estas dos existan y no debería considerarse como que la única que existe es la última que tenga la misma fecha. Me podrían ayudar? Muchas gracias > Agregaciones no retornan los valores correctos con rows con timestamps iguales > -- > > Key: SPARK-25996 > URL: https://issues.apache.org/jira/browse/SPARK-25996 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1, 2.4.0 > Environment: Windows 10 > PyCharm 2018.2.2 > Python 3.6 > >Reporter: Ignacio Gómez >Priority: Major > > Qué, tal? > Al utilizar pyspark realizo una operación de conteo de registros previos a la > fecha anterior de la row actual, incluyendo en el conteo la row actual, con > la correspondiente query: > query = """ > select *,* count ( * ) over (partition by ACCOUNTID > order by TS > range between interval 5000 milliseconds preceding and current row) as > total_count > from df3 > """ > df3 = sqlContext.sql(query) > y retorna lo siguiente: > +-+--+---+---+ > |ACCOUNTID|AMOUNT| TS|total_count| > +-+--+---+---+ > | 1| 100|2018-01-01 00:00:01| 1| > | 1| 1000|2018-01-01 10:00:01| 1| > | 1| 25|2018-01-01 10:00:02| 2| > | 1| 500|2018-01-01 10:00:03| 3| > | 1| 100|2018-01-01 10:00:04| 4| > | 1| 80|2018-01-01 10:00:05| 5| > | 1| 700|2018-01-01 11:00:04| 1| > | 1| 205|2018-01-02 10:00:02| 1| > | 1| 500|2018-01-02 10:00:03| 2| > | 3| 80|2018-01-02 10:00:05| 1| > Como se puede apreciar, en la tercera row, el total_count debería dar 3 en > vez de 2 porque existen 2 registros previos y no 1. En las rows siguientes, > se arrastra el error. > Esto ocurre con las demás operaciones de agregación. > Más allá de que la fecha de las primeras rows sea la misma, eso no quita que > estas dos existan y no debería considerarse como que la única que existe es > la última que tenga la misma fecha. > > Me podrían ayudar? > Muchas gracias -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25996) Agregaciones no retornan los valores correctos con rows con timestamps iguales
Ignacio Gómez created SPARK-25996: - Summary: Agregaciones no retornan los valores correctos con rows con timestamps iguales Key: SPARK-25996 URL: https://issues.apache.org/jira/browse/SPARK-25996 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0, 2.3.1 Environment: Windows 10 PyCharm 2018.2.2 Python 3.6 Reporter: Ignacio Gómez Qué, tal? Al utilizar pyspark realizo una operación de conteo de registros previos a la fecha anterior de la row actual, incluyendo en el conteo la row actual, con la correspondiente query: query = """ select *, count(*) over (partition by ACCOUNTID order by TS range between interval 5000 milliseconds preceding and current row) as total_count from df3 """ df3 = sqlContext.sql(query) y retorna lo siguiente: !image-2018-11-09-18-25-55-296.png! Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se arrastra el error. Esto ocurre con las demás operaciones de agregación. Más allá de que la fecha de las primeras rows sea la misma, eso no quita que estas dos existan y no debería considerarse como que la única que existe es la última que tenga la misma fecha. Me podrían ayudar? Muchas gracias -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895 ] John Bauer edited comment on SPARK-21542 at 11/9/18 8:07 PM: - Compared to the previous, the above example is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: {code:java} impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("impute_model") impm.explainParams(){code} was (Author: johnhbauer): This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: {code:java} impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("impute_model") impm.explainParams(){code} > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini >Assignee: Ajay Saini >Priority: Major > Fix For: 2.3.0 > > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895 ] John Bauer edited comment on SPARK-21542 at 11/9/18 7:56 PM: - This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: {code:java} impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("impute_model") impm.explainParams(){code} was (Author: johnhbauer): This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: {code:java} impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("imputer_model") impm = ImputeNormalModel.load("impute_model") impm.getInputCol() impm.getOutputCol() impm.getMean() impm.getStddev(){code} > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini >Assignee: Ajay Saini >Priority: Major > Fix For: 2.3.0 > > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895 ] John Bauer commented on SPARK-21542: This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("imputer_model") impm = ImputeNormalModel.load("impute_model") impm.getInputCol() impm.getOutputCol() impm.getMean() impm.getStddev() > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini >Assignee: Ajay Saini >Priority: Major > Fix For: 2.3.0 > > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895 ] John Bauer edited comment on SPARK-21542 at 11/9/18 7:54 PM: - This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: {code:java} impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("imputer_model") impm = ImputeNormalModel.load("impute_model") impm.getInputCol() impm.getOutputCol() impm.getMean() impm.getStddev(){code} was (Author: johnhbauer): This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("imputer_model") impm = ImputeNormalModel.load("impute_model") impm.getInputCol() impm.getOutputCol() impm.getMean() impm.getStddev() > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini >Assignee: Ajay Saini >Priority: Major > Fix For: 2.3.0 > > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()
[ https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681892#comment-16681892 ] Ruslan Dautkhanov commented on SPARK-25958: --- Yep, the pyspark job completes fine afetr we removed ipv6 references in /etc/hosts Thank you both > error: [Errno 97] Address family not supported by protocol in dataframe.take() > -- > > Key: SPARK-25958 > URL: https://issues.apache.org/jira/browse/SPARK-25958 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.3.1, 2.3.2 >Reporter: Ruslan Dautkhanov >Priority: Major > > Following error happens on a heavy Spark job after 4 hours of runtime.. > {code} > 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: > [Errno 97] Address family not supported by protocol > Traceback (most recent call last): > File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault > item.create_persistent_data() > File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", > line 53, in create_persistent_data > single_obj.create_persistent_data() > File > "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line > 21, in create_persistent_data > main_df = self.generate_dataframe_main() > File > "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line > 98, in generate_dataframe_main > raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation() > File > "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", > line 16, in get_raw_data_with_metadata_and_aggregation > main_df = > self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df) > File > "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", > line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe > return_df = self.get_dataframe_from_binary_value_iteration(input_df) > File > "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", > line 136, in get_dataframe_from_binary_value_iteration > combine_df = self.get_dataframe_from_binary_value(input_df=input_df, > binary_value=count) > File > "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", > line 154, in get_dataframe_from_binary_value > if len(results_of_filter_df.take(1)) == 0: > File > "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", > line 504, in take > return self.limit(num).collect() > File > "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", > line 467, in collect > return list(_load_from_socket(sock_info, > BatchedSerializer(PickleSerializer( > File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line > 148, in _load_from_socket > sock = socket.socket(af, socktype, proto) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in > __init__ > _sock = _realsocket(family, type, proto) > error: [Errno 97] Address family not supported by protocol > {code} > Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148: > {code} > def _load_from_socket(sock_info, serializer): > port, auth_secret = sock_info > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): > af, socktype, proto, canonname, sa = res > sock = socket.socket(af, socktype, proto) > try: > sock.settimeout(15) > sock.connect(sa) > except socket.error: > sock.close() > sock = None > continue > break > if not sock: > raise Exception("could not open socket") > # The RDD materialization time is unpredicable, if we set a timeout for > socket reading > # operation, it will very possibly fail. See SPARK-18281. > sock.settimeout(None) > sockfile = sock.makefile("rwb", 65536) > do_server_auth(sockfile, auth_secret) > # The socket will be automatically closed when garbage-collected. > return serializer.load_stream(sockfile) > {code} > the culprint is in lib/spark2/python/pyspark/rdd.py in this line > {code} > socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM) > {code} > so the error "error: [Errno 97] *Address family* not supported by protocol" > seems to be caused by socket.AF_UNSPEC third option to the > socket.getaddrinfo() call. > I tried to call similar socket.getaddrinfo call locally outside
[jira] [Resolved] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()
[ https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov resolved SPARK-25958. --- Resolution: Not A Problem > error: [Errno 97] Address family not supported by protocol in dataframe.take() > -- > > Key: SPARK-25958 > URL: https://issues.apache.org/jira/browse/SPARK-25958 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.3.1, 2.3.2 >Reporter: Ruslan Dautkhanov >Priority: Major > > Following error happens on a heavy Spark job after 4 hours of runtime.. > {code} > 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: > [Errno 97] Address family not supported by protocol > Traceback (most recent call last): > File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault > item.create_persistent_data() > File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", > line 53, in create_persistent_data > single_obj.create_persistent_data() > File > "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line > 21, in create_persistent_data > main_df = self.generate_dataframe_main() > File > "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line > 98, in generate_dataframe_main > raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation() > File > "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", > line 16, in get_raw_data_with_metadata_and_aggregation > main_df = > self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df) > File > "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", > line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe > return_df = self.get_dataframe_from_binary_value_iteration(input_df) > File > "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", > line 136, in get_dataframe_from_binary_value_iteration > combine_df = self.get_dataframe_from_binary_value(input_df=input_df, > binary_value=count) > File > "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", > line 154, in get_dataframe_from_binary_value > if len(results_of_filter_df.take(1)) == 0: > File > "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", > line 504, in take > return self.limit(num).collect() > File > "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", > line 467, in collect > return list(_load_from_socket(sock_info, > BatchedSerializer(PickleSerializer( > File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line > 148, in _load_from_socket > sock = socket.socket(af, socktype, proto) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in > __init__ > _sock = _realsocket(family, type, proto) > error: [Errno 97] Address family not supported by protocol > {code} > Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148: > {code} > def _load_from_socket(sock_info, serializer): > port, auth_secret = sock_info > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): > af, socktype, proto, canonname, sa = res > sock = socket.socket(af, socktype, proto) > try: > sock.settimeout(15) > sock.connect(sa) > except socket.error: > sock.close() > sock = None > continue > break > if not sock: > raise Exception("could not open socket") > # The RDD materialization time is unpredicable, if we set a timeout for > socket reading > # operation, it will very possibly fail. See SPARK-18281. > sock.settimeout(None) > sockfile = sock.makefile("rwb", 65536) > do_server_auth(sockfile, auth_secret) > # The socket will be automatically closed when garbage-collected. > return serializer.load_stream(sockfile) > {code} > the culprint is in lib/spark2/python/pyspark/rdd.py in this line > {code} > socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM) > {code} > so the error "error: [Errno 97] *Address family* not supported by protocol" > seems to be caused by socket.AF_UNSPEC third option to the > socket.getaddrinfo() call. > I tried to call similar socket.getaddrinfo call locally outside of PySpark > and it worked fine. > RHEL 7.5. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681891#comment-16681891 ] John Bauer commented on SPARK-21542: {code} from pyspark.sql import SparkSession from pyspark.sql.functions import col, when, randn from pyspark import keyword_only from pyspark.ml import Estimator, Model #from pyspark.ml.feature import SQLTransformer from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable from pyspark.ml.param import Param, Params, TypeConverters from pyspark.ml.param.shared import HasInputCol, HasOutputCol spark = SparkSession\ .builder\ .appName("ImputeNormal")\ .getOrCreate() class ImputeNormal(Estimator, HasInputCol, HasOutputCol, DefaultParamsReadable, DefaultParamsWritable, ): @keyword_only def __init__(self, inputCol="inputCol", outputCol="outputCol"): super(ImputeNormal, self).__init__() self._setDefault(inputCol="inputCol", outputCol="outputCol") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol="inputCol", outputCol="outputCol"): """ setParams(self, inputCol="inputCol", outputCol="outputCol") """ kwargs = self._input_kwargs self._set(**kwargs) return self def _fit(self, data): inputCol = self.getInputCol() outputCol = self.getOutputCol() stats = data.select(inputCol).describe() mean = stats.where(col("summary") == "mean").take(1)[0][inputCol] stddev = stats.where(col("summary") == "stddev").take(1)[0][inputCol] return ImputeNormalModel(mean=float(mean), stddev=float(stddev), inputCol=inputCol, outputCol=outputCol, ) # FOR A TRULY MINIMAL BUT LESS DIDACTICALLY EFFECTIVE DEMO, DO INSTEAD: #sql_text = "SELECT *, IF({inputCol} IS NULL, {stddev} * randn() + {mean}, {inputCol}) AS {outputCol} FROM __THIS__" # #return SQLTransformer(statement=sql_text.format(stddev=stddev, mean=mean, inputCol=inputCol, outputCol=outputCol)) class ImputeNormalModel(Model, HasInputCol, HasOutputCol, DefaultParamsReadable, DefaultParamsWritable, ): mean = Param(Params._dummy(), "mean", "Mean value of imputations. Calculated by fit method.", typeConverter=TypeConverters.toFloat) stddev = Param(Params._dummy(), "stddev", "Standard deviation of imputations. Calculated by fit method.", typeConverter=TypeConverters.toFloat) @keyword_only def __init__(self, mean=0.0, stddev=1.0, inputCol="inputCol", outputCol="outputCol"): super(ImputeNormalModel, self).__init__() self._setDefault(mean=0.0, stddev=1.0, inputCol="inputCol", outputCol="outputCol") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, mean=0.0, stddev=1.0, inputCol="inputCol", outputCol="outputCol"): """ setParams(self, mean=0.0, stddev=1.0, inputCol="inputCol", outputCol="outputCol") """ kwargs = self._input_kwargs self._set(**kwargs) return self def getMean(self): return self.getOrDefault(self.mean) def setMean(self, mean): self._set(mean=mean) def getStddev(self): return self.getOrDefault(self.stddev) def setStddev(self, stddev): self._set(stddev=stddev) def _transform(self, data): mean = self.getMean() stddev = self.getStddev() inputCol = self.getInputCol() outputCol = self.getOutputCol() df = data.withColumn(outputCol, when(col(inputCol).isNull(), stddev * randn() + mean).\ otherwise(col(inputCol))) return df if __name__ == "__main__": train = spark.createDataFrame([[0],[1],[2]] + [[None]]*100,['input']) impute = ImputeNormal(inputCol='input', outputCol='output') impute_model = impute.fit(train) print("Input column: {}".format(impute_model.getInputCol())) print("Output column: {}".format(impute_model.getOutputCol())) print("Mean: {}".format(impute_model.getMean())) print("Standard Deviation: {}".format(impute_model.getStddev())) test = impute_model.transform(train) test.show(10) test.describe().show() print("mean and stddev for outputCol should be close to those of inputCol"){code} > Helper functions for custom Python Persistence >
[jira] [Created] (SPARK-25995) sparkR should ensure user args are after the argument used for the port
Thomas Graves created SPARK-25995: - Summary: sparkR should ensure user args are after the argument used for the port Key: SPARK-25995 URL: https://issues.apache.org/jira/browse/SPARK-25995 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.3.2 Reporter: Thomas Graves Currently if you run sparkR and accidentally specify an argument, it fails with a useless error message. For example: $SPARK_HOME/bin/sparkR --master yarn --deploy-mode client fooarg This gets turned into: Launching java with spark-submit command spark-submit "--master" "yarn" "--deploy-mode" "client" "sparkr-shell" "fooarg" /tmp/Rtmp6XBGz2/backend_port162806ea36bca Notice that "fooarg" got put before /tmp file which is how R and jvm know which port to connect to. SparkR eventually fails with timeout exception after 10 seconds. SparkR should either not allow args or make sure the order is correct so the backend_port is always first. see https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L129 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25994) SPIP: DataFrame-based graph queries and algorithms
Xiangrui Meng created SPARK-25994: - Summary: SPIP: DataFrame-based graph queries and algorithms Key: SPARK-25994 URL: https://issues.apache.org/jira/browse/SPARK-25994 Project: Spark Issue Type: New Feature Components: GraphX Affects Versions: 3.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng [placeholder] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24244) Parse only required columns of CSV file
[ https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681786#comment-16681786 ] Maxim Gekk commented on SPARK-24244: > is this new option available in PySpark too? Yes, it is as well as in R and Scala/Java. > Parse only required columns of CSV file > --- > > Key: SPARK-24244 > URL: https://issues.apache.org/jira/browse/SPARK-24244 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > uniVocity parser allows to specify only required column names or indexes for > parsing like: > {code} > // Here we select only the columns by their indexes. > // The parser just skips the values in other columns > parserSettings.selectIndexes(4, 0, 1); > CsvParser parser = new CsvParser(parserSettings); > {code} > Need to modify *UnivocityParser* to extract only needed columns from > requiredSchema -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25993) Add test cases for resolution of ORC table location
Xiao Li created SPARK-25993: --- Summary: Add test cases for resolution of ORC table location Key: SPARK-25993 URL: https://issues.apache.org/jira/browse/SPARK-25993 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.3.2 Reporter: Xiao Li Add a test case based on the following example. The behavior was changed in 2.3 release. We also need to upgrade the migration guide. {code:java} val someDF1 = Seq( (1, 1, "blah"), (1, 2, "blahblah") ).toDF("folder", "number", "word").repartition(1) someDF1.write.orc("/tmp/orctab1/dir1/") someDF1.write.orc("/mnt/orctab1/dir2/") create external table tab1(folder int,number int,word string) STORED AS ORC LOCATION '/tmp/orctab1/"); select * from tab1; create external table tab2(folder int,number int,word string) STORED AS ORC LOCATION '/tmp/orctab1/*"); select * from tab2; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25993) Add test cases for resolution of ORC table location
[ https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25993: Labels: starter (was: ) > Add test cases for resolution of ORC table location > --- > > Key: SPARK-25993 > URL: https://issues.apache.org/jira/browse/SPARK-25993 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.2 >Reporter: Xiao Li >Priority: Major > Labels: starter > > Add a test case based on the following example. The behavior was changed in > 2.3 release. We also need to upgrade the migration guide. > {code:java} > val someDF1 = Seq( > (1, 1, "blah"), > (1, 2, "blahblah") > ).toDF("folder", "number", "word").repartition(1) > someDF1.write.orc("/tmp/orctab1/dir1/") > someDF1.write.orc("/mnt/orctab1/dir2/") > create external table tab1(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/"); > select * from tab1; > create external table tab2(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/*"); > select * from tab2; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25979) Window function: allow parentheses around window reference
[ https://issues.apache.org/jira/browse/SPARK-25979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25979. - Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 3.0.0 2.4.1 > Window function: allow parentheses around window reference > -- > > Key: SPARK-25979 > URL: https://issues.apache.org/jira/browse/SPARK-25979 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 2.4.1, 3.0.0 > > > Very minor parser bug, but possibly problematic for code-generated queries: > Consider the following two queries: > {code} > SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER > BY 1 > {code} > and > {code} > SELECT avg(k) OVER w FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY > 1 > {code} > The former, with parens around the OVER condition, fails to parse while the > latter, without parens, succeeds: > {code} > Error in SQL statement: ParseException: > mismatched input '(' expecting {, ',', 'FROM', 'WHERE', 'GROUP', > 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', > 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 19) > == SQL == > SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER > BY 1 > ---^^^ > {code} > This was found when running the cockroach DB tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25988) Keep names unchanged when deduplicating the column names in Analyzer
[ https://issues.apache.org/jira/browse/SPARK-25988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25988. - Resolution: Fixed Fix Version/s: 3.0.0 2.4.1 > Keep names unchanged when deduplicating the column names in Analyzer > > > Key: SPARK-25988 > URL: https://issues.apache.org/jira/browse/SPARK-25988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > {code} > withTempView("tmpView1", "tmpView2") { > withTable("tab1", "tab2") { > sql( > """ > |CREATE TABLE `tab1` (`col1` INT, `TDATE` DATE) > |USING CSV > |PARTITIONED BY (TDATE) > """.stripMargin) > spark.table("tab1").where("TDATE >= > '2017-08-15'").createOrReplaceTempView("tmpView1") > sql("CREATE TABLE `tab2` (`TDATE` DATE) USING parquet") > sql( > """ > |CREATE OR REPLACE TEMPORARY VIEW tmpView2 AS > |SELECT N.tdate, col1 AS aliasCol1 > |FROM tmpView1 N > |JOIN tab2 Z > |ON N.tdate = Z.tdate > """.stripMargin) > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") { > sql("SELECT * FROM tmpView2 x JOIN tmpView2 y ON x.tdate = > y.tdate").collect() > } > } > } > {code} > The above code will issue the following error. > {code} > Expected only partition pruning predicates: > ArrayBuffer(isnotnull(tdate#11986), (cast(tdate#11986 as string) >= > 2017-08-15)); > org.apache.spark.sql.AnalysisException: Expected only partition pruning > predicates: ArrayBuffer(isnotnull(tdate#11986), (cast(tdate#11986 as string) > >= 2017-08-15)); > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:146) > at > org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.listPartitionsByFilter(InMemoryCatalog.scala:560) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:958) > at > org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at >
[jira] [Commented] (SPARK-24244) Parse only required columns of CSV file
[ https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681735#comment-16681735 ] Ruslan Dautkhanov commented on SPARK-24244: --- [~maxgekk] great improvement is this new option available in PySpark too? > Parse only required columns of CSV file > --- > > Key: SPARK-24244 > URL: https://issues.apache.org/jira/browse/SPARK-24244 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > uniVocity parser allows to specify only required column names or indexes for > parsing like: > {code} > // Here we select only the columns by their indexes. > // The parser just skips the values in other columns > parserSettings.selectIndexes(4, 0, 1); > CsvParser parser = new CsvParser(parserSettings); > {code} > Need to modify *UnivocityParser* to extract only needed columns from > requiredSchema -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24421: Assignee: (was: Apache Spark) > Accessing sun.misc.Cleaner in JDK11 > --- > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > Labels: release-notes > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681722#comment-16681722 ] Apache Spark commented on SPARK-24421: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/22993 > Accessing sun.misc.Cleaner in JDK11 > --- > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > Labels: release-notes > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-24421: -- Labels: release-notes (was: ) > Accessing sun.misc.Cleaner in JDK11 > --- > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > Labels: release-notes > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24421: Assignee: Apache Spark > Accessing sun.misc.Cleaner in JDK11 > --- > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Major > Labels: release-notes > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.
[ https://issues.apache.org/jira/browse/SPARK-23814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681712#comment-16681712 ] Julia commented on SPARK-23814: --- [~hyukjin.kwon] Still got the same error with Spark 2.3.1. > Couldn't read file with colon in name and new line character in one of the > field. > - > > Key: SPARK-23814 > URL: https://issues.apache.org/jira/browse/SPARK-23814 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell >Affects Versions: 2.2.0 >Reporter: bharath kumar avusherla >Priority: Major > > When the file name has colon and new line character in data, while reading > using spark.read.option("multiLine","true").csv("s3n://DirectoryPath/") > function. It is throwing *"**java.lang.IllegalArgumentException: > java.net.URISyntaxException: Relative path in absolute URI: > 2017-08-01T00:00:00Z.csv.gz"* error. If we remove the > option("multiLine","true"), it is working just fine though the file name has > colon in it. It is working fine, If i apply this option > *option("multiLine","true")* on any other file which doesn't have colon in > it. But when both are present (colon in file name and new line in the data), > it's not working. > {quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: > Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz > at org.apache.hadoop.fs.Path.initialize(Path.java:205) > at org.apache.hadoop.fs.Path.(Path.java:171) > at org.apache.hadoop.fs.Path.(Path.java:93) > at org.apache.hadoop.fs.Globber.glob(Globber.java:253) > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51) > at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.take(RDD.scala:1327) > at > org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) > ... 48 elided > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > 2017-08-01T00:00:00Z.csv.gz > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:202) > ... 86 more > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To
[jira] [Commented] (SPARK-25696) The storage memory displayed on spark Application UI is incorrect.
[ https://issues.apache.org/jira/browse/SPARK-25696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681500#comment-16681500 ] Sean Owen commented on SPARK-25696: --- Per the pull request -- the error is actually slightly different. Yes 1024 should be the factor, but, all the units need to be displayed as kibibytes, etc. KiB, GiB and so on. Just changing the 1000 is wrong. > The storage memory displayed on spark Application UI is incorrect. > -- > > Key: SPARK-25696 > URL: https://issues.apache.org/jira/browse/SPARK-25696 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: hantiantian >Priority: Major > > In the reported heartbeat information, the unit of the memory data is bytes, > which is converted by the formatBytes() function in the utils.js file before > being displayed in the interface. The cardinality of the unit conversion in > the formatBytes function is 1000, which should be 1024. > function formatBytes(bytes, type) > { if (type !== 'display') return bytes; if (bytes == 0) return '0.0 B'; > var k = 1000; var dm = 1; var sizes = ['B', 'KB', 'MB', 'GB', 'TB', > 'PB', 'EB', 'ZB', 'YB']; var i = Math.floor(Math.log(bytes) / > Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + > sizes[i]; } > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25973) Spark History Main page performance improvement
[ https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25973. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22982 [https://github.com/apache/spark/pull/22982] > Spark History Main page performance improvement > --- > > Key: SPARK-25973 > URL: https://issues.apache.org/jira/browse/SPARK-25973 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Assignee: William Montaz >Priority: Minor > Fix For: 3.0.0 > > Attachments: fix.patch > > > HistoryPage.scala counts applications (with a predicate depending on if it is > displaying incomplete or complete applications) to check if it must display > the dataTable. > Since it only checks if allAppsSize > 0, we could use exists method on the > iterator. This way we stop iterating at the first occurence found. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25973) Spark History Main page performance improvement
[ https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25973: - Assignee: William Montaz > Spark History Main page performance improvement > --- > > Key: SPARK-25973 > URL: https://issues.apache.org/jira/browse/SPARK-25973 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Assignee: William Montaz >Priority: Minor > Attachments: fix.patch > > > HistoryPage.scala counts applications (with a predicate depending on if it is > displaying incomplete or complete applications) to check if it must display > the dataTable. > Since it only checks if allAppsSize > 0, we could use exists method on the > iterator. This way we stop iterating at the first occurence found. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25992) Accumulators giving KeyError in pyspark
Abdeali Kothari created SPARK-25992: --- Summary: Accumulators giving KeyError in pyspark Key: SPARK-25992 URL: https://issues.apache.org/jira/browse/SPARK-25992 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.1 Reporter: Abdeali Kothari I am using accumulators and when I run my code, I sometimes get some warn messages. When I checked, there was nothing accumulated - not sure if I lost info from the accumulator or it worked and I can ignore this error ? The message: {noformat} Exception happened during processing of request from ('127.0.0.1', 62099) Traceback (most recent call last): File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock self.process_request(request, client_address) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 348, in process_request self.finish_request(request, client_address) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 361, in finish_request self.RequestHandlerClass(request, client_address, self) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 696, in __init__ self.handle() File "/usr/local/hadoop/spark2.3.1/python/pyspark/accumulators.py", line 238, in handle _accumulatorRegistry[aid] += update KeyError: 0 2018-11-09 19:09:08 ERROR DAGScheduler:91 - Failed to update accumulators for task 0 org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:634) at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1131) at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1123) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1123) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1206) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25991) Update binary for 2.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-25991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-25991. - Resolution: Invalid > Update binary for 2.4.0 release > --- > > Key: SPARK-25991 > URL: https://issues.apache.org/jira/browse/SPARK-25991 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Vladimir Tsvetkov >Priority: Major > Attachments: image-2018-11-09-20-12-47-245.png > > > Archive with 2.4.0 release contains old binaries > https://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25991) Update binary for 2.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-25991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-25991: Attachment: image-2018-11-09-20-12-47-245.png > Update binary for 2.4.0 release > --- > > Key: SPARK-25991 > URL: https://issues.apache.org/jira/browse/SPARK-25991 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Vladimir Tsvetkov >Priority: Major > Attachments: image-2018-11-09-20-12-47-245.png > > > Archive with 2.4.0 release contains old binaries > https://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25991) Update binary for 2.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-25991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681371#comment-16681371 ] Yuming Wang commented on SPARK-25991: - Please check your SPARK_HOME: !image-2018-11-09-20-12-47-245.png! > Update binary for 2.4.0 release > --- > > Key: SPARK-25991 > URL: https://issues.apache.org/jira/browse/SPARK-25991 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Vladimir Tsvetkov >Priority: Major > Attachments: image-2018-11-09-20-12-47-245.png > > > Archive with 2.4.0 release contains old binaries > https://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25991) Update binary for 2.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-25991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681306#comment-16681306 ] Vladimir Tsvetkov commented on SPARK-25991: --- [~yumwang] sounds strange, but I run spark-submit --version and I saw 2.3 version. May be I messed with my paths. Please close this issue. Thanks > Update binary for 2.4.0 release > --- > > Key: SPARK-25991 > URL: https://issues.apache.org/jira/browse/SPARK-25991 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Vladimir Tsvetkov >Priority: Major > > Archive with 2.4.0 release contains old binaries > https://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25991) Update binary for 2.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-25991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681300#comment-16681300 ] Yuming Wang commented on SPARK-25991: - Sorry. I do not understand what you mean. > Update binary for 2.4.0 release > --- > > Key: SPARK-25991 > URL: https://issues.apache.org/jira/browse/SPARK-25991 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Vladimir Tsvetkov >Priority: Major > > Archive with 2.4.0 release contains old binaries > https://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681297#comment-16681297 ] Steve Loughran commented on SPARK-25966: bq. Hadoop 3.1.x is not yet officially supported in Spark. true, but there were some changes in that S3A input stream which it is good to see if it caused this h3. better recovery of failures in the underlying read() call Before: [https://github.com/apache/hadoop/blob/branch-2/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L382] After: [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L364] h3. AWS SDK++ an update to a more recent AWS SDK. (1.11.271), which complains a lot more if you close in input stream while there's data h3. Adaptive seek policy When you start off with fadvise=normal the first read is the full file, but if you do a backward seek is switches to random IO (fs.s3a.experimental.fadvise=random): [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L281] Unless the fadvise=random is set (Best) or fadvise=sequential (completely wrong for striped columnar formats), the parquet reader is following that codepath. [~andrioni]: can you put the log {{org.apache.hadoop.fs.s3a.S3AInputStream}} into DEBUG and see what it says on these failures? > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage
[jira] [Created] (SPARK-25991) Update binary for 2.4.0 release
Vladimir Tsvetkov created SPARK-25991: - Summary: Update binary for 2.4.0 release Key: SPARK-25991 URL: https://issues.apache.org/jira/browse/SPARK-25991 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.0 Reporter: Vladimir Tsvetkov Archive with 2.4.0 release contains old binaries https://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681151#comment-16681151 ] Alan commented on SPARK-24421: -- If I understanding correctly, the high-level need is -XX:MaxDirectMemorySize=unlimited but without specifying a command line option. Do you specify any other arguments? Maybe you could include an arg file with all options? As regards the hack then it looks like it involves the non-public constructor needed for JNI NewDirectMemoryBuffer and then patching the cleaner field. Ugh, that it way too fragile as the JDK internals can change at any time, also hacking into buffer fields will break once java.base is fully encapsulated. > Accessing sun.misc.Cleaner in JDK11 > --- > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22737) Simplity OneVsRest transform
[ https://issues.apache.org/jira/browse/SPARK-22737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-22737. -- Resolution: Not A Problem > Simplity OneVsRest transform > > > Key: SPARK-22737 > URL: https://issues.apache.org/jira/browse/SPARK-22737 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Priority: Major > > Current impl of OneVsRest#transform is over-complicated. It sequentially > updates an acumulated column. > By using a direct UDF of prediction, we obtain a speedup of at least 2x. > On some extreme case with 20 classes, it obtain about 14x speedup. > The test code and performance comparsion details are in the corresponding PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681103#comment-16681103 ] Deej edited comment on SPARK-12216 at 11/9/18 9:08 AM: --- This issue has *NOT* been fixed, so marking it as Resolved is plain silly. Moreover, suggesting users to switch to other OSes is not only reckless but also regressive when there is a large community of users attempting to adopt Spark as one of their large scale data processing tools. So please stop with the condescension and work on fixing this bug as the community has been expecting for a long while now. As others have reported, I am able to successfully launch spark-shell and perform basic tasks (including sc.stop()) successfully. However, the moment I try to quit the repl session, it craps out immediately. Also, I am able to manually delete the said temp files/folders Spark creates in the temp directory so there are no permissions issues. Even executing these commands from a command prompt running as Administrator results in the same error, reinforcing the assumption that this is not related to permissions on the temp folder at all. Here is my set-up to reproduce this issue:- OS: Windows 10 Spark: version 2.3.2 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171) Stack trace: === scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@41167ded scala> sc.stop() scala> :quit 2018-11-09 00:10:42 ERROR ShutdownHookManager:91 - Exception while deleting Spark temp dir: C:\Users\user1\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87 java.io.IOException: Failed to delete: C:\Users\{color:#33}user1\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1074) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) was (Author: laal): This issue has *NOT* been fixed, so marking it as Resolved is plain silly. Moreover, suggesting users to switch to other OSes is not only reckless but also regressive when there is a large community of users attempting to adopt Spark as one of their large scale data processing tools. So please stop with the condescension and work on fixing this bug as the community has been expecting for a long while now. As others have reported, I am able to successfully launch spark-shell and perform basic tasks (including sc.stop()) successfully. However, the moment I try to quit the repl session, it craps out immediately. Here is my set-up to reproduce this issue:- OS: Windows 10 Spark: version 2.3.2 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171) Stack trace: === scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@41167ded scala> sc.stop() scala> :quit 2018-11-09 00:10:42 ERROR ShutdownHookManager:91 - Exception while deleting Spark temp dir: C:\Users\user1\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87 java.io.IOException: Failed to delete: C:\Users\{color:#33}user1{color}\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1074) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at
[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681103#comment-16681103 ] Deej commented on SPARK-12216: -- This issue has *NOT* been fixed, so marking it as Resolved is plain silly. Moreover, suggesting users to switch to other OSes is not only reckless but also regressive when there is a large community of users attempting to adopt Spark as one of their large scale data processing tools. So please stop with the condescension and work on fixing this bug as the community has been expecting for a long while now. As others have reported, I am able to successfully launch spark-shell and perform basic tasks (including sc.stop()) successfully. However, the moment I try to quit the repl session, it craps out immediately. Here is my set-up to reproduce this issue:- OS: Windows 10 Spark: version 2.3.2 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171) Stack trace: === scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@41167ded scala> sc.stop() scala> :quit 2018-11-09 00:10:42 ERROR ShutdownHookManager:91 - Exception while deleting Spark temp dir: C:\Users\user1\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87 java.io.IOException: Failed to delete: C:\Users\{color:#33}user1{color}\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1074) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Failed to delete: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at >
[jira] [Assigned] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release
[ https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24229: Assignee: (was: Apache Spark) > Upgrade to the latest Apache Thrift 0.10.0 release > -- > > Key: SPARK-24229 > URL: https://issues.apache.org/jira/browse/SPARK-24229 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Ray Donnelly >Priority: Critical > > According to [https://www.cvedetails.com/cve/CVE-2016-5397/] > > .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored > in Apache Spark (and then, for us, into PySpark). > > Can anyone help to assess the seriousness of this and what should be done > about it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release
[ https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681099#comment-16681099 ] Apache Spark commented on SPARK-24229: -- User 'Fokko' has created a pull request for this issue: https://github.com/apache/spark/pull/22992 > Upgrade to the latest Apache Thrift 0.10.0 release > -- > > Key: SPARK-24229 > URL: https://issues.apache.org/jira/browse/SPARK-24229 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Ray Donnelly >Priority: Critical > > According to [https://www.cvedetails.com/cve/CVE-2016-5397/] > > .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored > in Apache Spark (and then, for us, into PySpark). > > Can anyone help to assess the seriousness of this and what should be done > about it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release
[ https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24229: Assignee: Apache Spark > Upgrade to the latest Apache Thrift 0.10.0 release > -- > > Key: SPARK-24229 > URL: https://issues.apache.org/jira/browse/SPARK-24229 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Ray Donnelly >Assignee: Apache Spark >Priority: Critical > > According to [https://www.cvedetails.com/cve/CVE-2016-5397/] > > .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored > in Apache Spark (and then, for us, into PySpark). > > Can anyone help to assess the seriousness of this and what should be done > about it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25973) Spark History Main page performance improvement
[ https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25973: Assignee: (was: Apache Spark) > Spark History Main page performance improvement > --- > > Key: SPARK-25973 > URL: https://issues.apache.org/jira/browse/SPARK-25973 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Priority: Minor > Attachments: fix.patch > > > HistoryPage.scala counts applications (with a predicate depending on if it is > displaying incomplete or complete applications) to check if it must display > the dataTable. > Since it only checks if allAppsSize > 0, we could use exists method on the > iterator. This way we stop iterating at the first occurence found. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25973) Spark History Main page performance improvement
[ https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25973: Assignee: Apache Spark > Spark History Main page performance improvement > --- > > Key: SPARK-25973 > URL: https://issues.apache.org/jira/browse/SPARK-25973 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Assignee: Apache Spark >Priority: Minor > Attachments: fix.patch > > > HistoryPage.scala counts applications (with a predicate depending on if it is > displaying incomplete or complete applications) to check if it must display > the dataTable. > Since it only checks if allAppsSize > 0, we could use exists method on the > iterator. This way we stop iterating at the first occurence found. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25973) Spark History Main page performance improvement
[ https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681054#comment-16681054 ] Apache Spark commented on SPARK-25973: -- User 'Willymontaz' has created a pull request for this issue: https://github.com/apache/spark/pull/22982 > Spark History Main page performance improvement > --- > > Key: SPARK-25973 > URL: https://issues.apache.org/jira/browse/SPARK-25973 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Priority: Minor > Attachments: fix.patch > > > HistoryPage.scala counts applications (with a predicate depending on if it is > displaying incomplete or complete applications) to check if it must display > the dataTable. > Since it only checks if allAppsSize > 0, we could use exists method on the > iterator. This way we stop iterating at the first occurence found. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly
[ https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25989: Assignee: (was: Apache Spark) > OneVsRestModel handle empty outputCols incorrectly > -- > > Key: SPARK-25989 > URL: https://issues.apache.org/jira/browse/SPARK-25989 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > {\{ml.classification.ClassificationModel}} will ignore empty output columns. > However, \{{OneVsRestModel}} still try to append new column even if its name > is an empty string. > {code:java} > scala> ovrModel.setPredictionCol("").transform(test).show > +-+++---+ > |label| features| rawPrediction| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > scala> > ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show > +-+++---+ > |label| features| raw| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25852) we should filter the workOffers with freeCores>=CPUS_PER_TASK at first for better performance
[ https://issues.apache.org/jira/browse/SPARK-25852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-25852: Priority: Trivial (was: Major) > we should filter the workOffers with freeCores>=CPUS_PER_TASK at first for > better performance > - > > Key: SPARK-25852 > URL: https://issues.apache.org/jira/browse/SPARK-25852 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.3.2 >Reporter: zuotingbing >Priority: Trivial > Attachments: 2018-10-26_162822.png > > > We should filter the workOffers with freeCores>=CPUS_PER_TASK for better > performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25990) TRANSFORM should handle different data types correctly
Wenchen Fan created SPARK-25990: --- Summary: TRANSFORM should handle different data types correctly Key: SPARK-25990 URL: https://issues.apache.org/jira/browse/SPARK-25990 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly
[ https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681020#comment-16681020 ] Apache Spark commented on SPARK-25989: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/22991 > OneVsRestModel handle empty outputCols incorrectly > -- > > Key: SPARK-25989 > URL: https://issues.apache.org/jira/browse/SPARK-25989 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > {\{ml.classification.ClassificationModel}} will ignore empty output columns. > However, \{{OneVsRestModel}} still try to append new column even if its name > is an empty string. > {code:java} > scala> ovrModel.setPredictionCol("").transform(test).show > +-+++---+ > |label| features| rawPrediction| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > scala> > ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show > +-+++---+ > |label| features| raw| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly
[ https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681018#comment-16681018 ] Apache Spark commented on SPARK-25989: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/22991 > OneVsRestModel handle empty outputCols incorrectly > -- > > Key: SPARK-25989 > URL: https://issues.apache.org/jira/browse/SPARK-25989 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > {\{ml.classification.ClassificationModel}} will ignore empty output columns. > However, \{{OneVsRestModel}} still try to append new column even if its name > is an empty string. > {code:java} > scala> ovrModel.setPredictionCol("").transform(test).show > +-+++---+ > |label| features| rawPrediction| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > scala> > ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show > +-+++---+ > |label| features| raw| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly
[ https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25989: Assignee: Apache Spark > OneVsRestModel handle empty outputCols incorrectly > -- > > Key: SPARK-25989 > URL: https://issues.apache.org/jira/browse/SPARK-25989 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > {\{ml.classification.ClassificationModel}} will ignore empty output columns. > However, \{{OneVsRestModel}} still try to append new column even if its name > is an empty string. > {code:java} > scala> ovrModel.setPredictionCol("").transform(test).show > +-+++---+ > |label| features| rawPrediction| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > scala> > ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show > +-+++---+ > |label| features| raw| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly
[ https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-25989: - Priority: Minor (was: Major) > OneVsRestModel handle empty outputCols incorrectly > -- > > Key: SPARK-25989 > URL: https://issues.apache.org/jira/browse/SPARK-25989 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > {\{ml.classification.ClassificationModel}} will ignore empty output columns. > However, \{{OneVsRestModel}} still try to append new column even if its name > is an empty string. > {code:java} > scala> ovrModel.setPredictionCol("").transform(test).show > +-+++---+ > |label| features| rawPrediction| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > scala> > ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show > +-+++---+ > |label| features| raw| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly
zhengruifeng created SPARK-25989: Summary: OneVsRestModel handle empty outputCols incorrectly Key: SPARK-25989 URL: https://issues.apache.org/jira/browse/SPARK-25989 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng {\{ml.classification.ClassificationModel}} will ignore empty output columns. However, \{{OneVsRestModel}} still try to append new column even if its name is an empty string. {code:java} scala> ovrModel.setPredictionCol("").transform(test).show +-+++---+ |label| features| rawPrediction| | +-+++---+ | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| +-+++---+ only showing top 20 rows scala> ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show +-+++---+ |label| features| raw| | +-+++---+ | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| +-+++---+ only showing top 20 rows {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org