[jira] [Commented] (SPARK-20543) R should skip long running or non-essential tests when running on CRAN
[ https://issues.apache.org/jira/browse/SPARK-20543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999262#comment-15999262 ] Apache Spark commented on SPARK-20543: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/17878 > R should skip long running or non-essential tests when running on CRAN > -- > > Key: SPARK-20543 > URL: https://issues.apache.org/jira/browse/SPARK-20543 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.2.0, 2.3.0 > > > This is actually recommended in the CRAN policies -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20614) Use the same log4j configuration with Jenkins in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-20614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20614. -- Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.3.0 Target Version/s: 2.3.0 > Use the same log4j configuration with Jenkins in AppVeyor > - > > Key: SPARK-20614 > URL: https://issues.apache.org/jira/browse/SPARK-20614 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.3.0 > > > Currently, there are flooding logs in AppVeyor (in the console). This has > been fine because we can download all the logs. However, (given my > observations so far), logs are truncated when there are too many. > For example, see > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master > Even after the log is downloaded, it looks truncated as below: > {code} > [00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in > stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200) > [00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 > (TID 9213) > [00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage > 601.0 (TID 9212). 2473 bytes result sent to driver > {code} > Probably, it looks better to use the same log4j configuration that we are > using for Jenkins. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20520) R streaming tests failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999252#comment-15999252 ] Felix Cheung commented on SPARK-20520: -- waiting for the next RC to try with fix for SPARK-20571 > R streaming tests failed on Windows > --- > > Key: SPARK-20520 > URL: https://issues.apache.org/jira/browse/SPARK-20520 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > Running R CMD check on SparkR 2.2 RC1 packages > {code} > Failed > - > 1. Failure: read.stream, write.stream, awaitTermination, stopQuery > (@test_streaming.R#56) > head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3. > 1/1 mismatches > [1] 0 - 3 == -3 > 2. Failure: read.stream, write.stream, awaitTermination, stopQuery > (@test_streaming.R#60) > head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6. > 1/1 mismatches > [1] 3 - 6 == -3 > 3. Failure: print from explain, lastProgress, status, isActive > (@test_streaming.R#75) > any(grepl("\"description\" : \"MemorySink\"", > capture.output(lastProgress(q isn't true. > 4. Failure: Stream other format (@test_streaming.R#95) > - > head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3. > 1/1 mismatches > [1] 0 - 3 == -3 > 5. Failure: Stream other format (@test_streaming.R#98) > - > any(...) isn't true. > {code} > Need to investigate -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Description: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import functions as sf import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null entries in col1 are considered 'isin' the list ["a"] (it is not in the list so it should show): test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon: test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you was: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import functions as sf import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"] (it is not in the list so it should show): test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon: test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you > pyspark.sql, filtering with ~isin missing rows > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > from pyspark.sql import functions as sf > import pandas as pd > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null entries in col1 are considered 'isin' the list ["a"] (it > is not in the list so it should show): > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon: > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ >
[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Description: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import functions as sf import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"] (it is not in the list so it should show): test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon: test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you was: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import functions as sf import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon: test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you > pyspark.sql, filtering with ~isin missing rows > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > from pyspark.sql import functions as sf > import pandas as pd > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null is considered 'isin' the list ["a"] (it is not in the list > so it should show): > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon: > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| >
[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Description: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import functions as sf import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon: test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you was: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import functions as sf import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you > pyspark.sql, filtering with ~isin missing rows > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > from pyspark.sql import functions as sf > import pandas as pd > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null is considered 'isin' the list ["a"]: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon: > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by
[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Description: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import functions as sf import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you was: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import functions as sf import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you > pyspark.sql, filtering with ~isin missing rows > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > from pyspark.sql import functions as sf > import pandas as pd > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null is considered 'isin' the list ["a"]: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon! > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by
[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Description: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import functions as sf import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you was: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you > pyspark.sql, filtering with ~isin missing rows > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > from pyspark.sql import functions as sf > import pandas as pd > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null is considered 'isin' the list ["a"]: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon! > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Summary: pyspark.sql, filtering with ~isin missing rows (was: pyspark.sql, ~isin when columns contain null (missing rows)) > pyspark.sql, filtering with ~isin missing rows > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > import pandas as pd > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null is considered 'isin' the list ["a"]: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon! > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20617) pyspark.sql, ~isin when columns contain null (missing rows)
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Summary: pyspark.sql, ~isin when columns contain null (missing rows) (was: pyspark.sql, isin when columns contain null) > pyspark.sql, ~isin when columns contain null (missing rows) > > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > import pandas as pd > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null is considered 'isin' the list ["a"]: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon! > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20617) pyspark.sql, isin when columns contain null
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Description: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you was: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you > pyspark.sql, isin when columns contain null > > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > import pandas as pd > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null is considered 'isin' the list ["a"]: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon! > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20617) pyspark.sql, isin when columns contain null
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Description: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you was: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() #Expecting |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| # Got: |col1|col2| | b| 3| | c| 4| # My workarounds: # 1. # null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() # 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() |col1|col2|isin|\ |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you > pyspark.sql, isin when columns contain null > > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > import pandas as pd > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null is considered 'isin' the list ["a"]: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon! > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20617) pyspark.sql, isin when columns contain null
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Description: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() #Expecting |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| # Got: |col1|col2| | b| 3| | c| 4| # My workarounds: # 1. # null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() # 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() |col1|col2|isin|\ |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you was: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() +++ |col1|col2| +++ |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| +++ # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() #Expecting # +++ # |col1|col2| # +++ # |null| 0| # |null| 1| # | b| 3| # | c| 4| # +++ # Got: # +++ # |col1|col2| # +++ # | b| 3| # | c| 4| # +++ # My workarounds: # 1. # null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() # 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() # ++++ # |col1|col2|isin| # ++++ # |null| 0|null| # |null| 1|null| # | c| 4|null| # | b| 3|null| # ++++ > pyspark.sql, isin when columns contain null > > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > import pandas as pd > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null is considered 'isin' the list ["a"]: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > #Expecting > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > # Got: > |col1|col2| > | b| 3| > | c| 4| > # My workarounds: > # 1. > # null is considered 'in', so add OR isNull conditon! > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > # 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > |col1|col2|isin|\ > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20617) pyspark.sql, isin when columns contain null
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Description: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() +++ |col1|col2| +++ |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| +++ # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() #Expecting # +++ # |col1|col2| # +++ # |null| 0| # |null| 1| # | b| 3| # | c| 4| # +++ # Got: # +++ # |col1|col2| # +++ # | b| 3| # | c| 4| # +++ # My workarounds: # 1. # null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() # 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() # ++++ # |col1|col2|isin| # ++++ # |null| 0|null| # |null| 1|null| # | c| 4|null| # | b| 3|null| # ++++ was: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() # +++ # |col1|col2| # +++ # |null| 0| # |null| 1| # | a| 2| # | b| 3| # | c| 4| # +++ # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() #Expecting # +++ # |col1|col2| # +++ # |null| 0| # |null| 1| # | b| 3| # | c| 4| # +++ # Got: # +++ # |col1|col2| # +++ # | b| 3| # | c| 4| # +++ # My workarounds: # 1. # null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() # 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() # ++++ # |col1|col2|isin| # ++++ # |null| 0|null| # |null| 1|null| # | c| 4|null| # | b| 3|null| # ++++ > pyspark.sql, isin when columns contain null > > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > import pandas as pd > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > +++ > |col1|col2| > +++ > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > +++ > # Below shows null is considered 'isin' the list ["a"]: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > #Expecting > # +++ > # |col1|col2| > # +++ > # |null| 0| > # |null| 1| > # | b| 3| > # | c| 4| > # +++ > # Got: > # +++ > # |col1|col2| > # +++ > # | b| 3| > # | c| 4| > # +++ > # My workarounds: > # 1. > # null is considered 'in', so add OR isNull conditon! > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > # 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > # ++++ > # |col1|col2|isin| > # ++++ > #
[jira] [Created] (SPARK-20617) pyspark.sql, isin when columns contain null
Ed Lee created SPARK-20617: -- Summary: pyspark.sql, isin when columns contain null Key: SPARK-20617 URL: https://issues.apache.org/jira/browse/SPARK-20617 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.2.0 Environment: Ubuntu Xenial 16.04 Reporter: Ed Lee Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() # +++ # |col1|col2| # +++ # |null| 0| # |null| 1| # | a| 2| # | b| 3| # | c| 4| # +++ # Below shows null is considered 'isin' the list ["a"]: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() #Expecting # +++ # |col1|col2| # +++ # |null| 0| # |null| 1| # | b| 3| # | c| 4| # +++ # Got: # +++ # |col1|col2| # +++ # | b| 3| # | c| 4| # +++ # My workarounds: # 1. # null is considered 'in', so add OR isNull conditon! test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() # 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() # ++++ # |col1|col2|isin| # ++++ # |null| 0|null| # |null| 1|null| # | c| 4|null| # | b| 3|null| # ++++ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20616) RuleExecutor logDebug of batch results should show diff to start of batch
[ https://issues.apache.org/jira/browse/SPARK-20616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20616. - Resolution: Fixed Assignee: Juliusz Sompolski Fix Version/s: 2.2.0 2.1.2 > RuleExecutor logDebug of batch results should show diff to start of batch > - > > Key: SPARK-20616 > URL: https://issues.apache.org/jira/browse/SPARK-20616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski > Fix For: 2.1.2, 2.2.0 > > > Due to a likely typo, the logDebug msg printing the diff of query plans shows > a diff to the initial plan, not diff to the start of batch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true
[ https://issues.apache.org/jira/browse/SPARK-19532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999016#comment-15999016 ] Abhishek Madav commented on SPARK-19532: I am running into this issue wherein codepath similar to hiveWriterContainer is trying to the HDFS location. I tried setting spark.speculation to false but it doesn't seem to be the issue. Is there any workaround? This wait-time leads to make the job run real slow. > [Core]`DataStreamer for file` threads of DFSOutputStream leak if set > `spark.speculation` to true > > > Key: SPARK-19532 > URL: https://issues.apache.org/jira/browse/SPARK-19532 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > When set `spark.speculation` to true, from thread dump page of Executor of > WebUI, I found that there are about 1300 threads named "DataStreamer for > file > /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet" > in TIMED_WAITING state. > {code} > java.lang.Object.wait(Native Method) > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564) > {code} > The off-heap memory exceeds a lot until Executor exited with OOM exception. > This problem occurs only when writing data to the Hadoop(tasks may be killed > by Executor during writing). > Could this be related to [https://issues.apache.org/jira/browse/HDFS-9812]? > The version of Hadoop is 2.6.4. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.
[ https://issues.apache.org/jira/browse/SPARK-20615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20615: Assignee: (was: Apache Spark) > SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector > has a size greater than zero but no elements defined. > - > > Key: SPARK-20615 > URL: https://issues.apache.org/jira/browse/SPARK-20615 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Jon McLean >Priority: Minor > > org.apache.spark.ml.linalg.SparseVector.argmax throws an > IndexOutOfRangeException when the vector size is greater than zero and no > values are defined. The toString() representation of such a vector is " > (10,[],[])". This is because the argmax function tries to get the value > at indexes(0) without checking the size of the array. > Code inspection reveals that the mllib version of SparseVector should have > the same issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.
[ https://issues.apache.org/jira/browse/SPARK-20615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20615: Assignee: Apache Spark > SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector > has a size greater than zero but no elements defined. > - > > Key: SPARK-20615 > URL: https://issues.apache.org/jira/browse/SPARK-20615 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Jon McLean >Assignee: Apache Spark >Priority: Minor > > org.apache.spark.ml.linalg.SparseVector.argmax throws an > IndexOutOfRangeException when the vector size is greater than zero and no > values are defined. The toString() representation of such a vector is " > (10,[],[])". This is because the argmax function tries to get the value > at indexes(0) without checking the size of the array. > Code inspection reveals that the mllib version of SparseVector should have > the same issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.
[ https://issues.apache.org/jira/browse/SPARK-20615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998970#comment-15998970 ] Apache Spark commented on SPARK-20615: -- User 'jonmclean' has created a pull request for this issue: https://github.com/apache/spark/pull/17877 > SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector > has a size greater than zero but no elements defined. > - > > Key: SPARK-20615 > URL: https://issues.apache.org/jira/browse/SPARK-20615 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Jon McLean >Priority: Minor > > org.apache.spark.ml.linalg.SparseVector.argmax throws an > IndexOutOfRangeException when the vector size is greater than zero and no > values are defined. The toString() representation of such a vector is " > (10,[],[])". This is because the argmax function tries to get the value > at indexes(0) without checking the size of the array. > Code inspection reveals that the mllib version of SparseVector should have > the same issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20132) Add documentation for column string functions
[ https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-20132: --- Fix Version/s: 2.2.0 > Add documentation for column string functions > - > > Key: SPARK-20132 > URL: https://issues.apache.org/jira/browse/SPARK-20132 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Assignee: Michael Patterson >Priority: Minor > Labels: documentation, newbie > Fix For: 2.2.0, 2.3.0 > > > Four Column string functions do not have documentation for PySpark: > rlike > like > startswith > endswith > These functions are called through the _bin_op interface, which allows the > passing of a docstring. I have added docstrings with examples to each of the > four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998900#comment-15998900 ] Rupesh Mane commented on SPARK-18105: - I'm facing this issue with Spark 2.1.0 but not with Spark 2.0.2. I'm using AWS EMR 5.2.0 which has Spark 2.0.2 and jobs run successfully. With everything same (code, files to process, settings, etc.) when I use EMR 5.5.0 which has Spark 2.1.0 I run in this issue. Stack trace is slightly different (see below), similar to this one: https://github.com/lz4/lz4-java/issues/13 and was fixed in 2013. Comparing LZO binary dependency Spark 2.0.2 and Spark 2.1.0 both use LZ4 1.3.0. So I'm confused why it is working on older version of Spark. Only difference in directory structure I see is Spark 2.0.2 has LZ4 libraries in lib but not under python/lib folder. While Spark 2.1.0 has these libraries in both lib and python/lib folder. 2017-05-05 01:15:50,681 [ERROR ] schema: Exception raised during Operation: An error occurred while calling o104.save. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, ip-172-31-26-105.ec2.internal, executor 1): java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:163) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2606) at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2622) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3099) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853) at java.io.ObjectInputStream.(ObjectInputStream.java:349) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63) at
[jira] [Updated] (SPARK-19910) `stack` should not reject NULL values due to type mismatch
[ https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19910: -- Affects Version/s: 2.1.1 > `stack` should not reject NULL values due to type mismatch > -- > > Key: SPARK-19910 > URL: https://issues.apache.org/jira/browse/SPARK-19910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: Dongjoon Hyun > > Since `stack` function generates a table with nullable columns, it should > allow mixed null values. > {code} > scala> sql("select stack(3, 1, 2, 3)").printSchema > root > |-- col0: integer (nullable = true) > scala> sql("select stack(3, 1, 2, null)").printSchema > org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' > due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); > line 1 pos 7; > 'Project [unresolvedalias(stack(3, 1, 2, null), None)] > +- OneRowRelation$ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19910) `stack` should not reject NULL values due to type mismatch
[ https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998771#comment-15998771 ] Dongjoon Hyun commented on SPARK-19910: --- Hi, [~cloud_fan] and [~smilegator]. Could you review this issue and PR? > `stack` should not reject NULL values due to type mismatch > -- > > Key: SPARK-19910 > URL: https://issues.apache.org/jira/browse/SPARK-19910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Dongjoon Hyun > > Since `stack` function generates a table with nullable columns, it should > allow mixed null values. > {code} > scala> sql("select stack(3, 1, 2, 3)").printSchema > root > |-- col0: integer (nullable = true) > scala> sql("select stack(3, 1, 2, null)").printSchema > org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' > due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); > line 1 pos 7; > 'Project [unresolvedalias(stack(3, 1, 2, null), None)] > +- OneRowRelation$ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10878) Race condition when resolving Maven coordinates via Ivy
[ https://issues.apache.org/jira/browse/SPARK-10878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998731#comment-15998731 ] Jeeyoung Kim commented on SPARK-10878: -- [~joshrosen] Yes, I realized what are potential race conditions (both inside Ivy and how Spark uses Ivy). Regarding (1), even if Ivy becomes thread-safe, writing a temporary pom file with a fixed filename would break things - thus I think this is valuable thing to to do. I can attempt a patch around this. Regarding (2), I think it is quite inefficient solution, to have multiple resolution caches to get around this. My cache directory is half gigabytes right now, and having that per spark job seems inefficient. > Race condition when resolving Maven coordinates via Ivy > --- > > Key: SPARK-10878 > URL: https://issues.apache.org/jira/browse/SPARK-10878 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Ryan Williams >Priority: Minor > > I've recently been shell-scripting the creation of many concurrent > Spark-on-YARN apps and observing a fraction of them to fail with what I'm > guessing is a race condition in their Maven-coordinate resolution. > For example, I might spawn an app for each path in file {{paths}} with the > following shell script: > {code} > cat paths | parallel "$SPARK_HOME/bin/spark-submit foo.jar {}" > {code} > When doing this, I observe some fraction of the spawned jobs to fail with > errors like: > {code} > :: retrieving :: org.apache.spark#spark-submit-parent > confs: [default] > Exception in thread "main" java.lang.RuntimeException: problem during > retrieve of org.apache.spark#spark-submit-parent: java.text.ParseException: > failed to parse report: > /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml: > Premature end of file. > at > org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:249) > at > org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:83) > at org.apache.ivy.Ivy.retrieve(Ivy.java:551) > at > org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1006) > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.text.ParseException: failed to parse report: > /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml: > Premature end of file. > at > org.apache.ivy.plugins.report.XmlReportParser.parse(XmlReportParser.java:293) > at > org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:329) > at > org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:118) > ... 7 more > Caused by: org.xml.sax.SAXParseException; Premature end of file. > at > org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown > Source) > at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown > Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) > {code} > The more apps I try to launch simultaneously, the greater fraction of them > seem to fail with this or similar errors; a batch of ~10 will usually work > fine, a batch of 15 will see a few failures, and a batch of ~60 will have > dozens of failures. > [This gist shows 11 recent failures I > observed|https://gist.github.com/ryan-williams/648bff70e518de0c7c84]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20603) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite deserialization of initial offset with Spark 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-20603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20603: - Affects Version/s: 2.1.1 2.1.0 > Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite deserialization of initial > offset with Spark 2.1.0 > -- > > Key: SPARK-20603 > URL: https://issues.apache.org/jira/browse/SPARK-20603 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.1.2, 2.2.0 > > > This test is flaky. This is the recent failure: > https://spark-tests.appspot.com/builds/spark-branch-2.2-test-maven-hadoop-2.7/47 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20603) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite deserialization of initial offset with Spark 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-20603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20603. -- Resolution: Fixed Fix Version/s: 2.2.0 2.1.2 > Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite deserialization of initial > offset with Spark 2.1.0 > -- > > Key: SPARK-20603 > URL: https://issues.apache.org/jira/browse/SPARK-20603 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.1.2, 2.2.0 > > > This test is flaky. This is the recent failure: > https://spark-tests.appspot.com/builds/spark-branch-2.2-test-maven-hadoop-2.7/47 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter
[ https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20569: Assignee: (was: Apache Spark) > RuntimeReplaceable functions accept invalid third parameter > --- > > Key: SPARK-20569 > URL: https://issues.apache.org/jira/browse/SPARK-20569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: liuxian >Priority: Trivial > > >select Nvl(null,'1',3); > >3 > The function of "Nvl" has Only two input parameters,so, when input three > parameters, i think it should notice that:"Error in query: Invalid number of > arguments for function nvl". > Such as "nvl2", "nullIf","IfNull",these have a similar problem -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter
[ https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20569: Assignee: Apache Spark > RuntimeReplaceable functions accept invalid third parameter > --- > > Key: SPARK-20569 > URL: https://issues.apache.org/jira/browse/SPARK-20569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: liuxian >Assignee: Apache Spark >Priority: Trivial > > >select Nvl(null,'1',3); > >3 > The function of "Nvl" has Only two input parameters,so, when input three > parameters, i think it should notice that:"Error in query: Invalid number of > arguments for function nvl". > Such as "nvl2", "nullIf","IfNull",these have a similar problem -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter
[ https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998690#comment-15998690 ] Apache Spark commented on SPARK-20569: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17876 > RuntimeReplaceable functions accept invalid third parameter > --- > > Key: SPARK-20569 > URL: https://issues.apache.org/jira/browse/SPARK-20569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: liuxian >Priority: Trivial > > >select Nvl(null,'1',3); > >3 > The function of "Nvl" has Only two input parameters,so, when input three > parameters, i think it should notice that:"Error in query: Invalid number of > arguments for function nvl". > Such as "nvl2", "nullIf","IfNull",these have a similar problem -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20571) Flaky SparkR StructuredStreaming tests
[ https://issues.apache.org/jira/browse/SPARK-20571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998680#comment-15998680 ] Burak Yavuz commented on SPARK-20571: - Thanks! > Flaky SparkR StructuredStreaming tests > -- > > Key: SPARK-20571 > URL: https://issues.apache.org/jira/browse/SPARK-20571 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Burak Yavuz >Assignee: Felix Cheung > Fix For: 2.2.0, 2.3.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76399 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18971) Netty issue may cause the shuffle client hang
[ https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998654#comment-15998654 ] Shixiong Zhu commented on SPARK-18971: -- [~tgraves] No, as far as I known. But since Spark 2.2.0 has not yet been released, not sure how many people tested master or branch-2.2. > Netty issue may cause the shuffle client hang > - > > Key: SPARK-18971 > URL: https://issues.apache.org/jira/browse/SPARK-18971 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.2.0 > > > Check https://github.com/netty/netty/issues/6153 for details > You should be able to see the following similar stack track in the executor > thread dump. > {code} > "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE > at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504) > at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454) > at io.netty.util.Recycler$Stack.pop(Recycler.java:435) > at io.netty.util.Recycler.get(Recycler.java:144) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39) > at > io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:140) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18971) Netty issue may cause the shuffle client hang
[ https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998654#comment-15998654 ] Shixiong Zhu edited comment on SPARK-18971 at 5/5/17 5:49 PM: -- [~tgraves] No, as far as I know. But since Spark 2.2.0 has not yet been released, not sure how many people tested master or branch-2.2. was (Author: zsxwing): [~tgraves] No, as far as I known. But since Spark 2.2.0 has not yet been released, not sure how many people tested master or branch-2.2. > Netty issue may cause the shuffle client hang > - > > Key: SPARK-18971 > URL: https://issues.apache.org/jira/browse/SPARK-18971 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.2.0 > > > Check https://github.com/netty/netty/issues/6153 for details > You should be able to see the following similar stack track in the executor > thread dump. > {code} > "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE > at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504) > at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454) > at io.netty.util.Recycler$Stack.pop(Recycler.java:435) > at io.netty.util.Recycler.get(Recycler.java:144) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39) > at > io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:140) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20616) RuleExecutor logDebug of batch results should show diff to start of batch
[ https://issues.apache.org/jira/browse/SPARK-20616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20616: Assignee: Apache Spark > RuleExecutor logDebug of batch results should show diff to start of batch > - > > Key: SPARK-20616 > URL: https://issues.apache.org/jira/browse/SPARK-20616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Apache Spark > > Due to a likely typo, the logDebug msg printing the diff of query plans shows > a diff to the initial plan, not diff to the start of batch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20616) RuleExecutor logDebug of batch results should show diff to start of batch
[ https://issues.apache.org/jira/browse/SPARK-20616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20616: Assignee: (was: Apache Spark) > RuleExecutor logDebug of batch results should show diff to start of batch > - > > Key: SPARK-20616 > URL: https://issues.apache.org/jira/browse/SPARK-20616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski > > Due to a likely typo, the logDebug msg printing the diff of query plans shows > a diff to the initial plan, not diff to the start of batch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20616) RuleExecutor logDebug of batch results should show diff to start of batch
[ https://issues.apache.org/jira/browse/SPARK-20616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998639#comment-15998639 ] Apache Spark commented on SPARK-20616: -- User 'juliuszsompolski' has created a pull request for this issue: https://github.com/apache/spark/pull/17875 > RuleExecutor logDebug of batch results should show diff to start of batch > - > > Key: SPARK-20616 > URL: https://issues.apache.org/jira/browse/SPARK-20616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski > > Due to a likely typo, the logDebug msg printing the diff of query plans shows > a diff to the initial plan, not diff to the start of batch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20616) RuleExecutor logDebug of batch results should show diff to start of batch
Juliusz Sompolski created SPARK-20616: - Summary: RuleExecutor logDebug of batch results should show diff to start of batch Key: SPARK-20616 URL: https://issues.apache.org/jira/browse/SPARK-20616 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Juliusz Sompolski Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hua Liu updated SPARK-20564: Priority: Minor (was: Major) > a lot of executor failures when the executor number is more than 2000 > - > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.2, 2.1.0 >Reporter: Hua Liu >Priority: Minor > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get over 2000 containers in one request, and > then launch them almost simultaneously. These thousands of executors try to > retrieve spark props and register with driver within seconds. However, driver > handles executor registration, stop, removal and spark props retrieval in one > thread, and it can not handle such a large number of RPCs within a short > period of time. As a result, some executors cannot retrieve spark props > and/or register. These failed executors are then marked as failed, causing > executor removal and aggravating the overloading of driver, which leads to > more executor failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > number of containers that driver can ask for in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows steadily. The number of executor failures is reduced and > applications can reach the desired number of executors faster. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20381) ObjectHashAggregateExec is missing numOutputRows
[ https://issues.apache.org/jira/browse/SPARK-20381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20381. - Resolution: Fixed Assignee: yucai Fix Version/s: 2.2.0 > ObjectHashAggregateExec is missing numOutputRows > > > Key: SPARK-20381 > URL: https://issues.apache.org/jira/browse/SPARK-20381 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: yucai >Assignee: yucai > Fix For: 2.2.0 > > > Add SQL metrics of numOutputRows for ObjectHashAggregateExec. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998544#comment-15998544 ] Marcelo Vanzin commented on SPARK-20608: Doesn't it work if you add the namespace (not the NN addresses) in the config instead? e.g. {{hdfs://somenamespace}} instead of explicitly calling out the active and standby addresses. (That requires hdfs-site.xml to contain the namespace to namenode mappings, but that's generally how HA works anyway.) The problem I see with the patch is that the fact that you're catching {{StandbyException}} probably means a token is not being generated for the standby. So when it actually becomes active, things will fail because Spark doesn't have the right token to talk to it. > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18971) Netty issue may cause the shuffle client hang
[ https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998531#comment-15998531 ] Thomas Graves commented on SPARK-18971: --- [~zsxwing]have you seen any issues with the new netty version? We have hit a similar issue? > Netty issue may cause the shuffle client hang > - > > Key: SPARK-18971 > URL: https://issues.apache.org/jira/browse/SPARK-18971 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.2.0 > > > Check https://github.com/netty/netty/issues/6153 for details > You should be able to see the following similar stack track in the executor > thread dump. > {code} > "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE > at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504) > at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454) > at io.netty.util.Recycler$Stack.pop(Recycler.java:435) > at io.netty.util.Recycler.get(Recycler.java:144) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39) > at > io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:140) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18971) Netty issue may cause the shuffle client hang
[ https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998531#comment-15998531 ] Thomas Graves edited comment on SPARK-18971 at 5/5/17 4:31 PM: --- [~zsxwing]have you seen any issues with the new netty version? We have hit this same issue. was (Author: tgraves): [~zsxwing]have you seen any issues with the new netty version? We have hit a similar issue? > Netty issue may cause the shuffle client hang > - > > Key: SPARK-18971 > URL: https://issues.apache.org/jira/browse/SPARK-18971 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.2.0 > > > Check https://github.com/netty/netty/issues/6153 for details > You should be able to see the following similar stack track in the executor > thread dump. > {code} > "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE > at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504) > at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454) > at io.netty.util.Recycler$Stack.pop(Recycler.java:435) > at io.netty.util.Recycler.get(Recycler.java:144) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39) > at > io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:140) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20613) Double quotes in Windows batch script
[ https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reassigned SPARK-20613: - Assignee: Jarrett Meyer > Double quotes in Windows batch script > - > > Key: SPARK-20613 > URL: https://issues.apache.org/jira/browse/SPARK-20613 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.1.1 >Reporter: Jarrett Meyer >Assignee: Jarrett Meyer > Fix For: 2.1.2, 2.2.0, 2.3.0 > > > This is a new issue in version 2.1.1. This problem was not present in 2.1.0. > In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the > like that invokes the {{RUNNER}} have quotes. This opens and closes the quote > immediately, producing something like > {code} > RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java"" >ab c > {code} > The quote above {{a}} opens the quote. The quote above {{b}} closes the > quote. This creates a space at position {{c}}, which is invalid syntax. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20613) Double quotes in Windows batch script
[ https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998474#comment-15998474 ] Felix Cheung commented on SPARK-20613: -- [~shivaram]could you add jarretmeyer to contributor list in JIRA so I could resolve this bug to him? > Double quotes in Windows batch script > - > > Key: SPARK-20613 > URL: https://issues.apache.org/jira/browse/SPARK-20613 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.1.1 >Reporter: Jarrett Meyer > Fix For: 2.1.2, 2.2.0, 2.3.0 > > > This is a new issue in version 2.1.1. This problem was not present in 2.1.0. > In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the > like that invokes the {{RUNNER}} have quotes. This opens and closes the quote > immediately, producing something like > {code} > RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java"" >ab c > {code} > The quote above {{a}} opens the quote. The quote above {{b}} closes the > quote. This creates a space at position {{c}}, which is invalid syntax. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20613) Double quotes in Windows batch script
[ https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20613. -- Resolution: Fixed Fix Version/s: 2.3.0 2.2.0 2.1.2 Target Version/s: 2.1.2, 2.2.0, 2.3.0 > Double quotes in Windows batch script > - > > Key: SPARK-20613 > URL: https://issues.apache.org/jira/browse/SPARK-20613 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.1.1 >Reporter: Jarrett Meyer > Fix For: 2.1.2, 2.2.0, 2.3.0 > > > This is a new issue in version 2.1.1. This problem was not present in 2.1.0. > In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the > like that invokes the {{RUNNER}} have quotes. This opens and closes the quote > immediately, producing something like > {code} > RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java"" >ab c > {code} > The quote above {{a}} opens the quote. The quote above {{b}} closes the > quote. This creates a space at position {{c}}, which is invalid syntax. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter
[ https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998472#comment-15998472 ] Wenchen Fan commented on SPARK-20569: - yea this is a bug, I'm working on a fix > RuntimeReplaceable functions accept invalid third parameter > --- > > Key: SPARK-20569 > URL: https://issues.apache.org/jira/browse/SPARK-20569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: liuxian >Priority: Trivial > > >select Nvl(null,'1',3); > >3 > The function of "Nvl" has Only two input parameters,so, when input three > parameters, i think it should notice that:"Error in query: Invalid number of > arguments for function nvl". > Such as "nvl2", "nullIf","IfNull",these have a similar problem -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20581) Using AVG or SUM on a INT/BIGINT column with fraction operator will yield BIGINT instead of DOUBLE
[ https://issues.apache.org/jira/browse/SPARK-20581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998465#comment-15998465 ] Wenchen Fan commented on SPARK-20581: - [~smilegator] do you remember which PR fixed it? we can consider backport it. > Using AVG or SUM on a INT/BIGINT column with fraction operator will yield > BIGINT instead of DOUBLE > -- > > Key: SPARK-20581 > URL: https://issues.apache.org/jira/browse/SPARK-20581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: Dominic Ricard > > We stumbled on this multiple times and every time we are baffled by the > behavior of AVG and SUM. > Given the following SQL (Executed through Thrift): > {noformat} > SELECT SUM(col/2) FROM > (SELECT 3 as `col`) t > {noformat} > The result will be "1", when the expected and accurate result is 1.5 > Here's the explain plan: > {noformat} > == Physical Plan == > TungstenAggregate(key=[], functions=[(sum(cast((cast(col#1519342 as double) / > 2.0) as bigint)),mode=Final,isDistinct=false)], output=[_c0#1519344L]) > +- TungstenExchange SinglePartition, None >+- TungstenAggregate(key=[], functions=[(sum(cast((cast(col#1519342 as > double) / 2.0) as bigint)),mode=Partial,isDistinct=false)], > output=[sum#1519347L]) > +- Project [3 AS col#1519342] > +- Scan OneRowRelation[] > {noformat} > Why the extra cast to BIGINT? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.
[ https://issues.apache.org/jira/browse/SPARK-20615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998460#comment-15998460 ] Jon McLean commented on SPARK-20615: Thank you. I will submit a patch with tests. > SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector > has a size greater than zero but no elements defined. > - > > Key: SPARK-20615 > URL: https://issues.apache.org/jira/browse/SPARK-20615 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Jon McLean >Priority: Minor > > org.apache.spark.ml.linalg.SparseVector.argmax throws an > IndexOutOfRangeException when the vector size is greater than zero and no > values are defined. The toString() representation of such a vector is " > (10,[],[])". This is because the argmax function tries to get the value > at indexes(0) without checking the size of the array. > Code inspection reveals that the mllib version of SparseVector should have > the same issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.
[ https://issues.apache.org/jira/browse/SPARK-20615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998459#comment-15998459 ] Sean Owen commented on SPARK-20615: --- Agree, I think you just want to return 0 if numActives == 0 early in the method. > SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector > has a size greater than zero but no elements defined. > - > > Key: SPARK-20615 > URL: https://issues.apache.org/jira/browse/SPARK-20615 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Jon McLean >Priority: Minor > > org.apache.spark.ml.linalg.SparseVector.argmax throws an > IndexOutOfRangeException when the vector size is greater than zero and no > values are defined. The toString() representation of such a vector is " > (10,[],[])". This is because the argmax function tries to get the value > at indexes(0) without checking the size of the array. > Code inspection reveals that the mllib version of SparseVector should have > the same issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.
Jon McLean created SPARK-20615: -- Summary: SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined. Key: SPARK-20615 URL: https://issues.apache.org/jira/browse/SPARK-20615 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Jon McLean Priority: Minor org.apache.spark.ml.linalg.SparseVector.argmax throws an IndexOutOfRangeException when the vector size is greater than zero and no values are defined. The toString() representation of such a vector is " (10,[],[])". This is because the argmax function tries to get the value at indexes(0) without checking the size of the array. Code inspection reveals that the mllib version of SparseVector should have the same issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20495) Add StorageLevel to cacheTable API
[ https://issues.apache.org/jira/browse/SPARK-20495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998442#comment-15998442 ] Wenchen Fan commented on SPARK-20495: - we usually don't backport new API changes, but this one is very small and might be ok, cc [~redlighter] > Add StorageLevel to cacheTable API > --- > > Key: SPARK-20495 > URL: https://issues.apache.org/jira/browse/SPARK-20495 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li > Fix For: 2.3.0 > > > Currently, cacheTable API always uses the default MEMORY_AND_DISK storage > level. We can add a new cacheTable API with the extra parameter StorageLevel. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20495) Add StorageLevel to cacheTable API
[ https://issues.apache.org/jira/browse/SPARK-20495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998442#comment-15998442 ] Wenchen Fan edited comment on SPARK-20495 at 5/5/17 3:00 PM: - we usually don't backport new API changes, but this one is very small and might be ok, cc [~smilegator] was (Author: cloud_fan): we usually don't backport new API changes, but this one is very small and might be ok, cc [~redlighter] > Add StorageLevel to cacheTable API > --- > > Key: SPARK-20495 > URL: https://issues.apache.org/jira/browse/SPARK-20495 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li > Fix For: 2.3.0 > > > Currently, cacheTable API always uses the default MEMORY_AND_DISK storage > level. We can add a new cacheTable API with the extra parameter StorageLevel. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20495) Add StorageLevel to cacheTable API
[ https://issues.apache.org/jira/browse/SPARK-20495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998435#comment-15998435 ] PJ Fanning commented on SPARK-20495: Thanks everyone for working on this change. Is it too late to consider this for v2.2.0 or even v2.2.1? > Add StorageLevel to cacheTable API > --- > > Key: SPARK-20495 > URL: https://issues.apache.org/jira/browse/SPARK-20495 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li > Fix For: 2.3.0 > > > Currently, cacheTable API always uses the default MEMORY_AND_DISK storage > level. We can add a new cacheTable API with the extra parameter StorageLevel. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20495) Add StorageLevel to cacheTable API
[ https://issues.apache.org/jira/browse/SPARK-20495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20495. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 17802 [https://github.com/apache/spark/pull/17802] > Add StorageLevel to cacheTable API > --- > > Key: SPARK-20495 > URL: https://issues.apache.org/jira/browse/SPARK-20495 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li > Fix For: 2.3.0 > > > Currently, cacheTable API always uses the default MEMORY_AND_DISK storage > level. We can add a new cacheTable API with the extra parameter StorageLevel. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20612) Unresolvable attribute in Filter won't throw analysis exception
[ https://issues.apache.org/jira/browse/SPARK-20612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998402#comment-15998402 ] Apache Spark commented on SPARK-20612: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/17874 > Unresolvable attribute in Filter won't throw analysis exception > --- > > Key: SPARK-20612 > URL: https://issues.apache.org/jira/browse/SPARK-20612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > We have a rule in Analyzer that adds missing attributes in a Filter into its > child plan. It makes the following codes work: > {code} > val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y") > df.select("y").where("x=1") > {code} > It should throw an analysis exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20614) Use the same log4j configuration with Jenkins in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-20614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20614: Assignee: (was: Apache Spark) > Use the same log4j configuration with Jenkins in AppVeyor > - > > Key: SPARK-20614 > URL: https://issues.apache.org/jira/browse/SPARK-20614 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Currently, there are flooding logs in AppVeyor (in the console). This has > been fine because we can download all the logs. However, (given my > observations so far), logs are truncated when there are too many. > For example, see > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master > Even after the log is downloaded, it looks truncated as below: > {code} > [00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in > stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200) > [00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 > (TID 9213) > [00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage > 601.0 (TID 9212). 2473 bytes result sent to driver > {code} > Probably, it looks better to use the same log4j configuration that we are > using for Jenkins. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20614) Use the same log4j configuration with Jenkins in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-20614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20614: Assignee: Apache Spark > Use the same log4j configuration with Jenkins in AppVeyor > - > > Key: SPARK-20614 > URL: https://issues.apache.org/jira/browse/SPARK-20614 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > Currently, there are flooding logs in AppVeyor (in the console). This has > been fine because we can download all the logs. However, (given my > observations so far), logs are truncated when there are too many. > For example, see > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master > Even after the log is downloaded, it looks truncated as below: > {code} > [00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in > stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200) > [00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 > (TID 9213) > [00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage > 601.0 (TID 9212). 2473 bytes result sent to driver > {code} > Probably, it looks better to use the same log4j configuration that we are > using for Jenkins. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20614) Use the same log4j configuration with Jenkins in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-20614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998359#comment-15998359 ] Apache Spark commented on SPARK-20614: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17873 > Use the same log4j configuration with Jenkins in AppVeyor > - > > Key: SPARK-20614 > URL: https://issues.apache.org/jira/browse/SPARK-20614 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Currently, there are flooding logs in AppVeyor (in the console). This has > been fine because we can download all the logs. However, (given my > observations so far), logs are truncated when there are too many. > For example, see > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master > Even after the log is downloaded, it looks truncated as below: > {code} > [00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in > stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200) > [00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 > (TID 9213) > [00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage > 601.0 (TID 9212). 2473 bytes result sent to driver > {code} > Probably, it looks better to use the same log4j configuration that we are > using for Jenkins. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20614) Use the same log4j configuration with Jenkins in AppVeyor
Hyukjin Kwon created SPARK-20614: Summary: Use the same log4j configuration with Jenkins in AppVeyor Key: SPARK-20614 URL: https://issues.apache.org/jira/browse/SPARK-20614 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 2.2.0 Reporter: Hyukjin Kwon Currently, there are flooding logs in AppVeyor (in the console). This has been fine because we can download all the logs. However, (given my observations so far), logs are truncated when there are too many. For example, see https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master Even after the log is downloaded, it looks truncated as below: {code} [00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200) [00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 (TID 9213) [00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage 601.0 (TID 9212). 2473 bytes result sent to driver {code} Probably, it looks better to use the same log4j configuration that we are using for Jenkins. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20489) Different results in local mode and yarn mode when working with dates (race condition with SimpleDateFormat?)
[ https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998228#comment-15998228 ] Rick Moritz commented on SPARK-20489: - If someone could try and replicate my observations, I think that would be a great bit of help - the above code should run as-is. > Different results in local mode and yarn mode when working with dates (race > condition with SimpleDateFormat?) > - > > Key: SPARK-20489 > URL: https://issues.apache.org/jira/browse/SPARK-20489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: yarn-client mode in Zeppelin, Cloudera > Spark2-distribution >Reporter: Rick Moritz >Priority: Critical > > Running the following code (in Zeppelin, or spark-shell), I get different > results, depending on whether I am using local[*] -mode or yarn-client mode: > {code:title=test case|borderStyle=solid} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > import spark.implicits._ > val counter = 1 to 2 > val size = 1 to 3 > val sampleText = spark.createDataFrame( > sc.parallelize(size) > .map(Row(_)), > StructType(Array(StructField("id", IntegerType, nullable=false)) > ) > ) > .withColumn("loadDTS",lit("2017-04-25T10:45:02.2")) > > val rddList = counter.map( > count => sampleText > .withColumn("loadDTS2", > date_format(date_add(col("loadDTS"),count),"-MM-dd'T'HH:mm:ss.SSS")) > .drop(col("loadDTS")) > .withColumnRenamed("loadDTS2","loadDTS") > .coalesce(4) > .rdd > ) > val resultText = spark.createDataFrame( > spark.sparkContext.union(rddList), > sampleText.schema > ) > val testGrouped = resultText.groupBy("id") > val timestamps = testGrouped.agg( > max(unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) as > "timestamp" > ) > val loadDateResult = resultText.join(timestamps, "id") > val filteredresult = loadDateResult.filter($"timestamp" === > unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) > filteredresult.count > {code} > The expected result, *3* is what I obtain in local mode, but as soon as I run > fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get > some results (depending on the size of counter) - none of which makes any > sense. > Up to the application of the last filter, at first glance everything looks > okay, but then something goes wrong. Potentially this is due to lingering > re-use of SimpleDateFormats, but I can't get it to happen in a > non-distributed mode. The generated execution plan is the same in each case, > as expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998197#comment-15998197 ] Yuechen Chen commented on SPARK-20608: -- [~ste...@apache.org] Your worry is reasonable. In our tests, there are two possible exceptions when yarn.spark.access.namenodes=hdfs://activeNamenode,hdfs://standbyNamenode 1) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category WRITE is not supported in state standby 2) Caused by: org.apache.hadoop.ipc.StandbyException: Operation category WRITE is not supported in state standby Maybe RemoteException should be caught by better way. > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998189#comment-15998189 ] Apache Spark commented on SPARK-20608: -- User 'morenn520' has created a pull request for this issue: https://github.com/apache/spark/pull/17872 > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20613) Double quotes in Windows batch script
[ https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20613: -- Priority: Major (was: Blocker) > Double quotes in Windows batch script > - > > Key: SPARK-20613 > URL: https://issues.apache.org/jira/browse/SPARK-20613 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.1.1 >Reporter: Jarrett Meyer > > This is a new issue in version 2.1.1. This problem was not present in 2.1.0. > In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the > like that invokes the {{RUNNER}} have quotes. This opens and closes the quote > immediately, producing something like > {code} > RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java"" >ab c > {code} > The quote above {{a}} opens the quote. The quote above {{b}} closes the > quote. This creates a space at position {{c}}, which is invalid syntax. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20613) Double quotes in Windows batch script
[ https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jarrett Meyer updated SPARK-20613: -- Description: This is a new issue in version 2.1.1. This problem was not present in 2.1.0. In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the like that invokes the {{RUNNER}} have quotes. This opens and closes the quote immediately, producing something like {code} RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java"" ab c {code} The quote above {{a}} opens the quote. The quote above {{b}} closes the quote. This creates a space at position {{c}}, which is invalid syntax. was: In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the like that invokes the {{RUNNER}} have quotes. This opens and closes the quote immediately, producing something like {code} RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java"" ab c {code} The quote above {{a}} opens the quote. The quote above {{b}} closes the quote. This creates a space at position {{c}}, which is invalid syntax. > Double quotes in Windows batch script > - > > Key: SPARK-20613 > URL: https://issues.apache.org/jira/browse/SPARK-20613 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.1.1 >Reporter: Jarrett Meyer >Priority: Blocker > > This is a new issue in version 2.1.1. This problem was not present in 2.1.0. > In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the > like that invokes the {{RUNNER}} have quotes. This opens and closes the quote > immediately, producing something like > {code} > RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java"" >ab c > {code} > The quote above {{a}} opens the quote. The quote above {{b}} closes the > quote. This creates a space at position {{c}}, which is invalid syntax. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20613) Double quotes in Windows batch script
[ https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20613: Assignee: (was: Apache Spark) > Double quotes in Windows batch script > - > > Key: SPARK-20613 > URL: https://issues.apache.org/jira/browse/SPARK-20613 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.1.1 >Reporter: Jarrett Meyer >Priority: Blocker > > In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the > like that invokes the {{RUNNER}} have quotes. This opens and closes the quote > immediately, producing something like > {code} > RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java"" >ab c > {code} > The quote above {{a}} opens the quote. The quote above {{b}} closes the > quote. This creates a space at position {{c}}, which is invalid syntax. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20613) Double quotes in Windows batch script
[ https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20613: Assignee: Apache Spark > Double quotes in Windows batch script > - > > Key: SPARK-20613 > URL: https://issues.apache.org/jira/browse/SPARK-20613 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.1.1 >Reporter: Jarrett Meyer >Assignee: Apache Spark >Priority: Blocker > > In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the > like that invokes the {{RUNNER}} have quotes. This opens and closes the quote > immediately, producing something like > {code} > RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java"" >ab c > {code} > The quote above {{a}} opens the quote. The quote above {{b}} closes the > quote. This creates a space at position {{c}}, which is invalid syntax. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20613) Double quotes in Windows batch script
[ https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998160#comment-15998160 ] Apache Spark commented on SPARK-20613: -- User 'jarrettmeyer' has created a pull request for this issue: https://github.com/apache/spark/pull/17861 > Double quotes in Windows batch script > - > > Key: SPARK-20613 > URL: https://issues.apache.org/jira/browse/SPARK-20613 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.1.1 >Reporter: Jarrett Meyer >Priority: Blocker > > In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the > like that invokes the {{RUNNER}} have quotes. This opens and closes the quote > immediately, producing something like > {code} > RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java"" >ab c > {code} > The quote above {{a}} opens the quote. The quote above {{b}} closes the > quote. This creates a space at position {{c}}, which is invalid syntax. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20613) Double quotes in Windows batch script
Jarrett Meyer created SPARK-20613: - Summary: Double quotes in Windows batch script Key: SPARK-20613 URL: https://issues.apache.org/jira/browse/SPARK-20613 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 2.1.1 Reporter: Jarrett Meyer Priority: Blocker In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the like that invokes the {{RUNNER}} have quotes. This opens and closes the quote immediately, producing something like {code} RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java"" ab c {code} The quote above {{a}} opens the quote. The quote above {{b}} closes the quote. This creates a space at position {{c}}, which is invalid syntax. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998150#comment-15998150 ] Steve Loughran commented on SPARK-20608: Probably good to pull in someone who understands HDFS HA; I nominate [~liuml07]. My main worry is that RemoteException could be a symptom of something more serious than the node being in standby, but I don't know enough about NN HA for my opinions to be trusted. > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20546) spark-class gets syntax error in posix mode
[ https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20546: - Assignee: Jessie Yu > spark-class gets syntax error in posix mode > --- > > Key: SPARK-20546 > URL: https://issues.apache.org/jira/browse/SPARK-20546 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.2 >Reporter: Jessie Yu >Assignee: Jessie Yu >Priority: Minor > Fix For: 2.1.2, 2.2.1 > > > spark-class gets the following error when running in posix mode: > {code} > spark-class: line 78: syntax error near unexpected token `<' > spark-class: line 78: `done < <(build_command "$@")' > {code} > \\ > It appears to be complaining about the process substitution: > {code} > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <(build_command "$@") > {code} > \\ > This can be reproduced by first turning on allexport then posix mode: > {code}set -a -o posix {code} > then run something like spark-shell which calls spark-class. > \\ > The simplest fix is probably to always turn off posix mode in spark-class > before the while loop. > \\ > This was previously reported in > [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed > with cannot reproduce. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20546) spark-class gets syntax error in posix mode
[ https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20546. --- Resolution: Fixed Fix Version/s: 2.1.2 2.2.1 Issue resolved by pull request 17852 [https://github.com/apache/spark/pull/17852] > spark-class gets syntax error in posix mode > --- > > Key: SPARK-20546 > URL: https://issues.apache.org/jira/browse/SPARK-20546 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.2 >Reporter: Jessie Yu >Priority: Minor > Fix For: 2.2.1, 2.1.2 > > > spark-class gets the following error when running in posix mode: > {code} > spark-class: line 78: syntax error near unexpected token `<' > spark-class: line 78: `done < <(build_command "$@")' > {code} > \\ > It appears to be complaining about the process substitution: > {code} > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <(build_command "$@") > {code} > \\ > This can be reproduced by first turning on allexport then posix mode: > {code}set -a -o posix {code} > then run something like spark-shell which calls spark-class. > \\ > The simplest fix is probably to always turn off posix mode in spark-class > before the while loop. > \\ > This was previously reported in > [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed > with cannot reproduce. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution
[ https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998050#comment-15998050 ] Sean Owen commented on SPARK-20611: --- No, there's not necessarily any problem in Spark. The Logging trait changed over versions of Spark -- not in CDH -- and if you don't match the versions correctly, this internal API may not be compatible across releases, because it's not an external API. For example you generally use Spark 2 in CDH 5.10 but you are targeting 1.6. CDH doesn't support the Kinesis connector, though it may happen to work. This is in any event not an issue for Spark. > Spark kinesis connector doesnt work with cloudera distribution > --- > > Key: SPARK-20611 > URL: https://issues.apache.org/jira/browse/SPARK-20611 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: sumit > Labels: cloudera > Attachments: spark-kcl.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Facing below exception on CDH5.10 > 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 > (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError > at org.apache.spark.Logging$class.log(Logging.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39) > at org.apache.spark.Logging$class.logDebug(Logging.scala:62) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > below is my POM file > > org.apache.spark > spark-streaming_2.10 > 1.6.0 > > > org.apache.spark > spark-core_2.10 > 1.6.0 > > > com.amazonaws > amazon-kinesis-client > 1.6.1 > > > org.apache.spark > spark-streaming-kinesis-asl_2.10 > 1.6.0 > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuechen Chen updated SPARK-20608: - Description: If one Spark Application need to access remote namenodes, yarn.spark.access.namenodes should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if yarn.spark.access.namenodes includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in yarn.spark.access.namenodes, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: Spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application Codes: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) was: If one Spark Application need to access remote namenodes, {yarn.spark.access.namenodes} should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if {yarn.spark.access.namenodes} includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in {yarn.spark.access.namenodes}, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: Spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application Codes: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20612) Unresolvable attribute in Filter won't throw analysis exception
[ https://issues.apache.org/jira/browse/SPARK-20612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20612: Assignee: Apache Spark > Unresolvable attribute in Filter won't throw analysis exception > --- > > Key: SPARK-20612 > URL: https://issues.apache.org/jira/browse/SPARK-20612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > We have a rule in Analyzer that adds missing attributes in a Filter into its > child plan. It makes the following codes work: > {code} > val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y") > df.select("y").where("x=1") > {code} > It should throw an analysis exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20612) Unresolvable attribute in Filter won't throw analysis exception
[ https://issues.apache.org/jira/browse/SPARK-20612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20612: Assignee: (was: Apache Spark) > Unresolvable attribute in Filter won't throw analysis exception > --- > > Key: SPARK-20612 > URL: https://issues.apache.org/jira/browse/SPARK-20612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > We have a rule in Analyzer that adds missing attributes in a Filter into its > child plan. It makes the following codes work: > {code} > val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y") > df.select("y").where("x=1") > {code} > It should throw an analysis exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20612) Unresolvable attribute in Filter won't throw analysis exception
[ https://issues.apache.org/jira/browse/SPARK-20612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997993#comment-15997993 ] Apache Spark commented on SPARK-20612: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/17871 > Unresolvable attribute in Filter won't throw analysis exception > --- > > Key: SPARK-20612 > URL: https://issues.apache.org/jira/browse/SPARK-20612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > We have a rule in Analyzer that adds missing attributes in a Filter into its > child plan. It makes the following codes work: > {code} > val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y") > df.select("y").where("x=1") > {code} > It should throw an analysis exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution
[ https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997987#comment-15997987 ] sumit edited comment on SPARK-20611 at 5/5/17 9:32 AM: --- Hi [~sowen] does this mean I should log ticket to CDH . I thought as per spark external https://github.com/apache/spark/tree/master/external it will get fixed here . I am running against same version of spark which is 1.6. The issue is due to CDH distribution has modified the internal spark class for Logging please see - https://issues.apache.org/jira/browse/LEGAL-198 was (Author: sumitkumarkarn): Hi [~sowen] does this mean I should log ticket to CDH . I thought as per spark external https://github.com/apache/spark/tree/master/external it will get fixed here . I am running against same version of spark which is 1.6. The issue is due to CDH distribution has modified the internal spark class for Logging > Spark kinesis connector doesnt work with cloudera distribution > --- > > Key: SPARK-20611 > URL: https://issues.apache.org/jira/browse/SPARK-20611 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: sumit > Labels: cloudera > Attachments: spark-kcl.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Facing below exception on CDH5.10 > 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 > (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError > at org.apache.spark.Logging$class.log(Logging.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39) > at org.apache.spark.Logging$class.logDebug(Logging.scala:62) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > below is my POM file > > org.apache.spark > spark-streaming_2.10 > 1.6.0 > > > org.apache.spark > spark-core_2.10 > 1.6.0 > > > com.amazonaws > amazon-kinesis-client > 1.6.1 > > > org.apache.spark > spark-streaming-kinesis-asl_2.10 > 1.6.0 > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution
[ https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997987#comment-15997987 ] sumit edited comment on SPARK-20611 at 5/5/17 9:29 AM: --- Hi [~sowen] does this mean I should log ticket to CDH . I thought as per spark external https://github.com/apache/spark/tree/master/external it will get fixed here . I am running against same version of spark which is 1.6. The issue is due to CDH distribution has modified the internal spark class for Logging was (Author: sumitkumarkarn): Hi [~sowen] does this mean I should log ticket to CDH . I thought as per spark external https://github.com/apache/spark/tree/master/external it will get fixed here . > Spark kinesis connector doesnt work with cloudera distribution > --- > > Key: SPARK-20611 > URL: https://issues.apache.org/jira/browse/SPARK-20611 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: sumit > Labels: cloudera > Attachments: spark-kcl.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Facing below exception on CDH5.10 > 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 > (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError > at org.apache.spark.Logging$class.log(Logging.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39) > at org.apache.spark.Logging$class.logDebug(Logging.scala:62) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > below is my POM file > > org.apache.spark > spark-streaming_2.10 > 1.6.0 > > > org.apache.spark > spark-core_2.10 > 1.6.0 > > > com.amazonaws > amazon-kinesis-client > 1.6.1 > > > org.apache.spark > spark-streaming-kinesis-asl_2.10 > 1.6.0 > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution
[ https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997987#comment-15997987 ] sumit commented on SPARK-20611: --- Hi [~sowen] does this mean I should log ticket to CDH . I thought as per spark external https://github.com/apache/spark/tree/master/external it will get fixed here . > Spark kinesis connector doesnt work with cloudera distribution > --- > > Key: SPARK-20611 > URL: https://issues.apache.org/jira/browse/SPARK-20611 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: sumit > Labels: cloudera > Attachments: spark-kcl.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Facing below exception on CDH5.10 > 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 > (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError > at org.apache.spark.Logging$class.log(Logging.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39) > at org.apache.spark.Logging$class.logDebug(Logging.scala:62) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > below is my POM file > > org.apache.spark > spark-streaming_2.10 > 1.6.0 > > > org.apache.spark > spark-core_2.10 > 1.6.0 > > > com.amazonaws > amazon-kinesis-client > 1.6.1 > > > org.apache.spark > spark-streaming-kinesis-asl_2.10 > 1.6.0 > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20612) Unresolvable attribute in Filter won't throw analysis exception
Liang-Chi Hsieh created SPARK-20612: --- Summary: Unresolvable attribute in Filter won't throw analysis exception Key: SPARK-20612 URL: https://issues.apache.org/jira/browse/SPARK-20612 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Liang-Chi Hsieh We have a rule in Analyzer that adds missing attributes in a Filter into its child plan. It makes the following codes work: {code} val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y") df.select("y").where("x=1") {code} It should throw an analysis exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution
[ https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20611. --- Resolution: Not A Problem If a question is specific to CDH, it doesn't belong here, but rather at Cloudera. No, it doesn't actually fix anything to duplicate the Logging trait. We do not use patches. You should read http://spark.apache.org/contributing.html The problem is version mismatch. I don't think you have built vs the same version of Spark that you run against. > Spark kinesis connector doesnt work with cloudera distribution > --- > > Key: SPARK-20611 > URL: https://issues.apache.org/jira/browse/SPARK-20611 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: sumit > Labels: cloudera > Attachments: spark-kcl.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Facing below exception on CDH5.10 > 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 > (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError > at org.apache.spark.Logging$class.log(Logging.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39) > at org.apache.spark.Logging$class.logDebug(Logging.scala:62) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > below is my POM file > > org.apache.spark > spark-streaming_2.10 > 1.6.0 > > > org.apache.spark > spark-core_2.10 > 1.6.0 > > > com.amazonaws > amazon-kinesis-client > 1.6.1 > > > org.apache.spark > spark-streaming-kinesis-asl_2.10 > 1.6.0 > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution
[ https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-20611. - > Spark kinesis connector doesnt work with cloudera distribution > --- > > Key: SPARK-20611 > URL: https://issues.apache.org/jira/browse/SPARK-20611 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: sumit > Labels: cloudera > Attachments: spark-kcl.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Facing below exception on CDH5.10 > 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 > (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError > at org.apache.spark.Logging$class.log(Logging.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39) > at org.apache.spark.Logging$class.logDebug(Logging.scala:62) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > below is my POM file > > org.apache.spark > spark-streaming_2.10 > 1.6.0 > > > org.apache.spark > spark-core_2.10 > 1.6.0 > > > com.amazonaws > amazon-kinesis-client > 1.6.1 > > > org.apache.spark > spark-streaming-kinesis-asl_2.10 > 1.6.0 > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20608: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > {yarn.spark.access.namenodes} should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if {yarn.spark.access.namenodes} includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > {yarn.spark.access.namenodes}, and my Spark Application can be able to > sustain the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997948#comment-15997948 ] Sean Owen commented on SPARK-20608: --- CC [~vanzin] [~ste...@apache.org] > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > {yarn.spark.access.namenodes} should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if {yarn.spark.access.namenodes} includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > {yarn.spark.access.namenodes}, and my Spark Application can be able to > sustain the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20472) Support for Dynamic Configuration
[ https://issues.apache.org/jira/browse/SPARK-20472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997946#comment-15997946 ] Sean Owen commented on SPARK-20472: --- JVM config matters. How do you change the driver heap size in client mode after startup? What are the semantics of changing a batch size at runtime? cache size? It raises a lot of questions, so no this is not generally possible. > Support for Dynamic Configuration > - > > Key: SPARK-20472 > URL: https://issues.apache.org/jira/browse/SPARK-20472 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.1.0 >Reporter: Shahbaz Hussain > > Currently Spark Configuration can not be dynamically changed. > It requires Spark Job be killed and started again for a new configuration to > take in to effect. > This bug is to enhance Spark ,such that configuration changes can be > dynamically changed without requiring a application restart. > Ex: If Batch Interval in a Streaming Job is 20 seconds ,and if user wants to > reduce it to 5 seconds,currently it requires a re-submit of the job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuechen Chen updated SPARK-20608: - Description: If one Spark Application need to access remote namenodes, {yarn.spark.access.namenodes} should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if {yarn.spark.access.namenodes} includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in {yarn.spark.access.namenodes}, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: Spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application Codes: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) was: If one Spark Application need to access remote namenodes, ${yarn.spark.access.namenodes} should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if ${yarn.spark.access.namenodes} includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in ${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: Spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application Codes: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > {yarn.spark.access.namenodes} should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if {yarn.spark.access.namenodes} includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > {yarn.spark.access.namenodes}, and my Spark Application can be able to > sustain the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20608: Assignee: Apache Spark > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Assignee: Apache Spark > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > ${yarn.spark.access.namenodes} should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if ${yarn.spark.access.namenodes} includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > ${yarn.spark.access.namenodes}, and my Spark Application can be able to > sustain the failover of Hadoop namenode. > HA Examples: > spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuechen Chen updated SPARK-20608: - Description: If one Spark Application need to access remote namenodes, ${yarn.spark.access.namenodes} should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if ${yarn.spark.access.namenodes} includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in ${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: Spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application Codes: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) was: If one Spark Application need to access remote namenodes, ${yarn.spark.access.namenodes} should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if ${yarn.spark.access.namenodes} includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in ${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > ${yarn.spark.access.namenodes} should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if ${yarn.spark.access.namenodes} includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > ${yarn.spark.access.namenodes}, and my Spark Application can be able to > sustain the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20608: Assignee: (was: Apache Spark) > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > ${yarn.spark.access.namenodes} should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if ${yarn.spark.access.namenodes} includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > ${yarn.spark.access.namenodes}, and my Spark Application can be able to > sustain the failover of Hadoop namenode. > HA Examples: > spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997900#comment-15997900 ] Apache Spark commented on SPARK-20608: -- User 'morenn520' has created a pull request for this issue: https://github.com/apache/spark/pull/17870 > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > ${yarn.spark.access.namenodes} should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if ${yarn.spark.access.namenodes} includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > ${yarn.spark.access.namenodes}, and my Spark Application can be able to > sustain the failover of Hadoop namenode. > HA Examples: > spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution
[ https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sumit updated SPARK-20611: -- Summary: Spark kinesis connector doesnt work with cloudera distribution (was: Spark kinesis connector doesn work with cloudera distribution) > Spark kinesis connector doesnt work with cloudera distribution > --- > > Key: SPARK-20611 > URL: https://issues.apache.org/jira/browse/SPARK-20611 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: sumit > Labels: cloudera > Attachments: spark-kcl.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Facing below exception on CDH5.10 > 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 > (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError > at org.apache.spark.Logging$class.log(Logging.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39) > at org.apache.spark.Logging$class.logDebug(Logging.scala:62) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > below is my POM file > > org.apache.spark > spark-streaming_2.10 > 1.6.0 > > > org.apache.spark > spark-core_2.10 > 1.6.0 > > > com.amazonaws > amazon-kinesis-client > 1.6.1 > > > org.apache.spark > spark-streaming-kinesis-asl_2.10 > 1.6.0 > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20611) Spark kinesis connector doesn work with cloudera distribution
[ https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997889#comment-15997889 ] sumit commented on SPARK-20611: --- please evaluate and review the patch file . If it looks good then I would like to submit the PR against it. Thanks > Spark kinesis connector doesn work with cloudera distribution > -- > > Key: SPARK-20611 > URL: https://issues.apache.org/jira/browse/SPARK-20611 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: sumit > Labels: cloudera > Attachments: spark-kcl.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Facing below exception on CDH5.10 > 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 > (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError > at org.apache.spark.Logging$class.log(Logging.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39) > at org.apache.spark.Logging$class.logDebug(Logging.scala:62) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > below is my POM file > > org.apache.spark > spark-streaming_2.10 > 1.6.0 > > > org.apache.spark > spark-core_2.10 > 1.6.0 > > > com.amazonaws > amazon-kinesis-client > 1.6.1 > > > org.apache.spark > spark-streaming-kinesis-asl_2.10 > 1.6.0 > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20611) Spark kinesis connector doesn work with cloudera distribution
[ https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sumit updated SPARK-20611: -- Attachment: spark-kcl.patch attached Patch is doing exactly same what we have done in the past for cassandra connector in the external link of this ticket i.e. https://datastax-oss.atlassian.net/browse/SPARKC-460 > Spark kinesis connector doesn work with cloudera distribution > -- > > Key: SPARK-20611 > URL: https://issues.apache.org/jira/browse/SPARK-20611 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: sumit > Labels: cloudera > Attachments: spark-kcl.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Facing below exception on CDH5.10 > 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 > (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError > at org.apache.spark.Logging$class.log(Logging.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39) > at org.apache.spark.Logging$class.logDebug(Logging.scala:62) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50) > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > below is my POM file > > org.apache.spark > spark-streaming_2.10 > 1.6.0 > > > org.apache.spark > spark-core_2.10 > 1.6.0 > > > com.amazonaws > amazon-kinesis-client > 1.6.1 > > > org.apache.spark > spark-streaming-kinesis-asl_2.10 > 1.6.0 > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20611) Spark kinesis connector doesn work with cloudera distribution
sumit created SPARK-20611: - Summary: Spark kinesis connector doesn work with cloudera distribution Key: SPARK-20611 URL: https://issues.apache.org/jira/browse/SPARK-20611 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: sumit Facing below exception on CDH5.10 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError at org.apache.spark.Logging$class.log(Logging.scala:50) at org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39) at org.apache.spark.Logging$class.logDebug(Logging.scala:62) at org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39) at org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119) at org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50) at org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565) at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) below is my POM file org.apache.spark spark-streaming_2.10 1.6.0 org.apache.spark spark-core_2.10 1.6.0 com.amazonaws amazon-kinesis-client 1.6.1 org.apache.spark spark-streaming-kinesis-asl_2.10 1.6.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20610) Support a function get DataFrame/DataSet from Transformer
[ https://issues.apache.org/jira/browse/SPARK-20610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet closed SPARK-20610. - Resolution: Won't Fix > Support a function get DataFrame/DataSet from Transformer > - > > Key: SPARK-20610 > URL: https://issues.apache.org/jira/browse/SPARK-20610 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2, 2.1.0 >Reporter: darion yaphet > > We are using stages to build our machine learning pipeline. Transformer will > transformers input dataset into another output dataset as our dataframe. > Sometime we will test the dataframe's result when developing the pipeline. > But it is looks like difficulty to running a test . If spark ml Stages could > support a interface to explore the dataframe processed by the stage , we > could use it to running test . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20610) Support a function get DataFrame/DataSet from Transformer
darion yaphet created SPARK-20610: - Summary: Support a function get DataFrame/DataSet from Transformer Key: SPARK-20610 URL: https://issues.apache.org/jira/browse/SPARK-20610 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.1.0, 2.0.2 Reporter: darion yaphet We are using stages to build our machine learning pipeline. Transformer will transformers input dataset into another output dataset as our dataframe. Sometime we will test the dataframe's result when developing the pipeline. But it is looks like difficulty to running a test . If spark ml Stages could support a interface to explore the dataframe processed by the stage , we could use it to running test . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20472) Support for Dynamic Configuration
[ https://issues.apache.org/jira/browse/SPARK-20472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997874#comment-15997874 ] Shahbaz Hussain commented on SPARK-20472: - Yes ,the idea is to have a way by which we can persist configuration in memory,like for Ex: Batch Interval ,sql shuffle partitions etc ,primarily these are Spark Specific configuration. JVM configuration are global and cant be changed ,this request is not for Dyncamic Configuration for JVM but for Spark application specific. > Support for Dynamic Configuration > - > > Key: SPARK-20472 > URL: https://issues.apache.org/jira/browse/SPARK-20472 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.1.0 >Reporter: Shahbaz Hussain > > Currently Spark Configuration can not be dynamically changed. > It requires Spark Job be killed and started again for a new configuration to > take in to effect. > This bug is to enhance Spark ,such that configuration changes can be > dynamically changed without requiring a application restart. > Ex: If Batch Interval in a Streaming Job is 20 seconds ,and if user wants to > reduce it to 5 seconds,currently it requires a re-submit of the job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20545) union set operator should default to DISTINCT and not ALL semantics
[ https://issues.apache.org/jira/browse/SPARK-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li closed SPARK-20545. --- Resolution: Cannot Reproduce > union set operator should default to DISTINCT and not ALL semantics > --- > > Key: SPARK-20545 > URL: https://issues.apache.org/jira/browse/SPARK-20545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: N Campbell > > A set operation (i.e union) over two queries that produce identical row > values should return the distinct set of rows and not all rows. > ISO-SQL set operation semantics default to DISTINCT > SPARK implementation is defaulting to ALL > While SPARK allows DISTINCT keyword and some might assume ALL is faster, the > wrong result set semantically is produced per standard (and commercial SQL > systems including: ORACLE, DB2, Teradata, SQL Server etc.) > select tsint.csint from cert.tsint > union > select tint.cint from cert.tint > csint > > -1 > 0 > 1 > 10 > > -1 > 0 > 1 > 10 > vs > select tsint.csint from cert.tsint union distinct select tint.cint from > cert.tint > csint > -1 > > 1 > 10 > 0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20545) union set operator should default to DISTINCT and not ALL semantics
[ https://issues.apache.org/jira/browse/SPARK-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997845#comment-15997845 ] Xiao Li commented on SPARK-20545: - Please reopen it if you still hit this issue. Thanks! > union set operator should default to DISTINCT and not ALL semantics > --- > > Key: SPARK-20545 > URL: https://issues.apache.org/jira/browse/SPARK-20545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: N Campbell > > A set operation (i.e union) over two queries that produce identical row > values should return the distinct set of rows and not all rows. > ISO-SQL set operation semantics default to DISTINCT > SPARK implementation is defaulting to ALL > While SPARK allows DISTINCT keyword and some might assume ALL is faster, the > wrong result set semantically is produced per standard (and commercial SQL > systems including: ORACLE, DB2, Teradata, SQL Server etc.) > select tsint.csint from cert.tsint > union > select tint.cint from cert.tint > csint > > -1 > 0 > 1 > 10 > > -1 > 0 > 1 > 10 > vs > select tsint.csint from cert.tsint union distinct select tint.cint from > cert.tint > csint > -1 > > 1 > 10 > 0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20545) union set operator should default to DISTINCT and not ALL semantics
[ https://issues.apache.org/jira/browse/SPARK-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997843#comment-15997843 ] Xiao Li commented on SPARK-20545: - You can try {noformat} select 3 as `col` union select 3 as `col` {noformat} It outputs 3. In Spark SQL, if neither ALL nor DISTINCT is used, DISTINCT behavior is the default. > union set operator should default to DISTINCT and not ALL semantics > --- > > Key: SPARK-20545 > URL: https://issues.apache.org/jira/browse/SPARK-20545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: N Campbell > > A set operation (i.e union) over two queries that produce identical row > values should return the distinct set of rows and not all rows. > ISO-SQL set operation semantics default to DISTINCT > SPARK implementation is defaulting to ALL > While SPARK allows DISTINCT keyword and some might assume ALL is faster, the > wrong result set semantically is produced per standard (and commercial SQL > systems including: ORACLE, DB2, Teradata, SQL Server etc.) > select tsint.csint from cert.tsint > union > select tint.cint from cert.tint > csint > > -1 > 0 > 1 > 10 > > -1 > 0 > 1 > 10 > vs > select tsint.csint from cert.tsint union distinct select tint.cint from > cert.tint > csint > -1 > > 1 > 10 > 0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org