[jira] [Commented] (SPARK-20543) R should skip long running or non-essential tests when running on CRAN

2017-05-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999262#comment-15999262
 ] 

Apache Spark commented on SPARK-20543:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/17878

> R should skip long running or non-essential tests when running on CRAN
> --
>
> Key: SPARK-20543
> URL: https://issues.apache.org/jira/browse/SPARK-20543
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0, 2.3.0
>
>
> This is actually recommended in the CRAN policies



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20614) Use the same log4j configuration with Jenkins in AppVeyor

2017-05-05 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20614.
--
  Resolution: Fixed
Assignee: Hyukjin Kwon
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> Use the same log4j configuration with Jenkins in AppVeyor
> -
>
> Key: SPARK-20614
> URL: https://issues.apache.org/jira/browse/SPARK-20614
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> Currently, there are flooding logs in AppVeyor (in the console). This has 
> been fine because we can download all the logs. However, (given my 
> observations so far), logs are truncated when there are too many.
> For example, see  
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master
> Even after the log is downloaded, it looks truncated as below:
> {code}
> [00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in 
> stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200)
> [00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 
> (TID 9213)
> [00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage 
> 601.0 (TID 9212). 2473 bytes result sent to driver
> {code}
> Probably, it looks better to use the same log4j configuration that we are 
> using for Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20520) R streaming tests failed on Windows

2017-05-05 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999252#comment-15999252
 ] 

Felix Cheung commented on SPARK-20520:
--

waiting for the next RC to try with fix for SPARK-20571

> R streaming tests failed on Windows
> ---
>
> Key: SPARK-20520
> URL: https://issues.apache.org/jira/browse/SPARK-20520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> Running R CMD check on SparkR 2.2 RC1 packages 
> {code}
> Failed 
> -
> 1. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#56) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 2. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#60) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6.
> 1/1 mismatches
> [1] 3 - 6 == -3
> 3. Failure: print from explain, lastProgress, status, isActive 
> (@test_streaming.R#75) 
> any(grepl("\"description\" : \"MemorySink\"", 
> capture.output(lastProgress(q isn't true.
> 4. Failure: Stream other format (@test_streaming.R#95) 
> -
> head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 5. Failure: Stream other format (@test_streaming.R#98) 
> -
> any(...) isn't true.
> {code}
> Need to investigate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows

2017-05-05 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Description: 
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import functions as sf
import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null entries in col1 are considered 'isin' the list ["a"] (it is 
not in the list so it should show):

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon:
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you


  was:
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import functions as sf
import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"] (it is not in the list 
so it should show):

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon:
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you



> pyspark.sql,  filtering with ~isin missing rows
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import functions as sf
> import pandas as pd
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null entries in col1 are considered 'isin' the list ["a"] (it 
> is not in the list so it should show):
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> 

[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows

2017-05-05 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Description: 
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import functions as sf
import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"] (it is not in the list 
so it should show):

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon:
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you


  was:
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import functions as sf
import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon:
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you



> pyspark.sql,  filtering with ~isin missing rows
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import functions as sf
> import pandas as pd
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null is considered 'isin' the list ["a"] (it is not in the list 
> so it should show):
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  

[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows

2017-05-05 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Description: 
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import functions as sf
import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon:
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you


  was:
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import functions as sf
import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you



> pyspark.sql,  filtering with ~isin missing rows
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import functions as sf
> import pandas as pd
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null is considered 'isin' the list ["a"]:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by 

[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows

2017-05-05 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Description: 
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import functions as sf
import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you


  was:
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:
from pyspark.sql import functions as sf
import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you



> pyspark.sql,  filtering with ~isin missing rows
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import functions as sf
> import pandas as pd
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null is considered 'isin' the list ["a"]:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon!
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by 

[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows

2017-05-05 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Description: 
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:
from pyspark.sql import functions as sf
import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you


  was:
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you



> pyspark.sql,  filtering with ~isin missing rows
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import functions as sf
> import pandas as pd
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null is considered 'isin' the list ["a"]:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon!
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows

2017-05-05 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Summary: pyspark.sql,  filtering with ~isin missing rows  (was: 
pyspark.sql,  ~isin when columns contain null (missing rows))

> pyspark.sql,  filtering with ~isin missing rows
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> import pandas as pd
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null is considered 'isin' the list ["a"]:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon!
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20617) pyspark.sql, ~isin when columns contain null (missing rows)

2017-05-05 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Summary: pyspark.sql,  ~isin when columns contain null (missing rows)  
(was: pyspark.sql,  isin when columns contain null)

> pyspark.sql,  ~isin when columns contain null (missing rows)
> 
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> import pandas as pd
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null is considered 'isin' the list ["a"]:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon!
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20617) pyspark.sql, isin when columns contain null

2017-05-05 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Description: 
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you


  was:
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you



> pyspark.sql,  isin when columns contain null
> 
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> import pandas as pd
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null is considered 'isin' the list ["a"]:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon!
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20617) pyspark.sql, isin when columns contain null

2017-05-05 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Description: 
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you


  was:
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

#Expecting
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

# Got:
 |col1|col2|
 |   b|   3|
 |   c|   4|

# My workarounds:

# 1.
# null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

# 2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

 |col1|col2|isin|\
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you



> pyspark.sql,  isin when columns contain null
> 
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> import pandas as pd
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null is considered 'isin' the list ["a"]:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon!
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20617) pyspark.sql, isin when columns contain null

2017-05-05 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Description: 
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

#Expecting
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

# Got:
 |col1|col2|
 |   b|   3|
 |   c|   4|

# My workarounds:

# 1.
# null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

# 2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

 |col1|col2|isin|\
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you


  was:
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 +++
 |col1|col2|
 +++
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|
 +++


# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

#Expecting
# +++
# |col1|col2|
# +++
# |null|   0|
# |null|   1|
# |   b|   3|
# |   c|   4|
# +++

# Got:
# +++
# |col1|col2|
# +++
# |   b|   3|
# |   c|   4|
# +++


# My workarounds:

# 1.
# null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

# 2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

# ++++
# |col1|col2|isin|
# ++++
# |null|   0|null|
# |null|   1|null|
# |   c|   4|null|
# |   b|   3|null|
# ++++




> pyspark.sql,  isin when columns contain null
> 
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> import pandas as pd
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null is considered 'isin' the list ["a"]:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> #Expecting
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> # Got:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> # My workarounds:
> # 1.
> # null is considered 'in', so add OR isNull conditon!
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> # 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
>  |col1|col2|isin|\
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20617) pyspark.sql, isin when columns contain null

2017-05-05 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Description: 
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 +++
 |col1|col2|
 +++
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|
 +++


# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

#Expecting
# +++
# |col1|col2|
# +++
# |null|   0|
# |null|   1|
# |   b|   3|
# |   c|   4|
# +++

# Got:
# +++
# |col1|col2|
# +++
# |   b|   3|
# |   c|   4|
# +++


# My workarounds:

# 1.
# null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

# 2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

# ++++
# |col1|col2|isin|
# ++++
# |null|   0|null|
# |null|   1|null|
# |   c|   4|null|
# |   b|   3|null|
# ++++



  was:
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

# +++
# |col1|col2|
# +++
# |null|   0|
# |null|   1|
# |   a|   2|
# |   b|   3|
# |   c|   4|
# +++


# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

#Expecting
# +++
# |col1|col2|
# +++
# |null|   0|
# |null|   1|
# |   b|   3|
# |   c|   4|
# +++

# Got:
# +++
# |col1|col2|
# +++
# |   b|   3|
# |   c|   4|
# +++


# My workarounds:

# 1.
# null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

# 2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

# ++++
# |col1|col2|isin|
# ++++
# |null|   0|null|
# |null|   1|null|
# |   c|   4|null|
# |   b|   3|null|
# ++++




> pyspark.sql,  isin when columns contain null
> 
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> import pandas as pd
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  +++
>  |col1|col2|
>  +++
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
>  +++
> # Below shows null is considered 'isin' the list ["a"]:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> #Expecting
> # +++
> # |col1|col2|
> # +++
> # |null|   0|
> # |null|   1|
> # |   b|   3|
> # |   c|   4|
> # +++
> # Got:
> # +++
> # |col1|col2|
> # +++
> # |   b|   3|
> # |   c|   4|
> # +++
> # My workarounds:
> # 1.
> # null is considered 'in', so add OR isNull conditon!
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> # 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> # ++++
> # |col1|col2|isin|
> # ++++
> # 

[jira] [Created] (SPARK-20617) pyspark.sql, isin when columns contain null

2017-05-05 Thread Ed Lee (JIRA)
Ed Lee created SPARK-20617:
--

 Summary: pyspark.sql,  isin when columns contain null
 Key: SPARK-20617
 URL: https://issues.apache.org/jira/browse/SPARK-20617
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.2.0
 Environment: Ubuntu Xenial 16.04
Reporter: Ed Lee


Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

# +++
# |col1|col2|
# +++
# |null|   0|
# |null|   1|
# |   a|   2|
# |   b|   3|
# |   c|   4|
# +++


# Below shows null is considered 'isin' the list ["a"]:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

#Expecting
# +++
# |col1|col2|
# +++
# |null|   0|
# |null|   1|
# |   b|   3|
# |   c|   4|
# +++

# Got:
# +++
# |col1|col2|
# +++
# |   b|   3|
# |   c|   4|
# +++


# My workarounds:

# 1.
# null is considered 'in', so add OR isNull conditon!
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

# 2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

# ++++
# |col1|col2|isin|
# ++++
# |null|   0|null|
# |null|   1|null|
# |   c|   4|null|
# |   b|   3|null|
# ++++





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20616) RuleExecutor logDebug of batch results should show diff to start of batch

2017-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20616.
-
   Resolution: Fixed
 Assignee: Juliusz Sompolski
Fix Version/s: 2.2.0
   2.1.2

> RuleExecutor logDebug of batch results should show diff to start of batch
> -
>
> Key: SPARK-20616
> URL: https://issues.apache.org/jira/browse/SPARK-20616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
> Fix For: 2.1.2, 2.2.0
>
>
> Due to a likely typo, the logDebug msg printing the diff of query plans shows 
> a diff to the initial plan, not diff to the start of batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true

2017-05-05 Thread Abhishek Madav (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999016#comment-15999016
 ] 

Abhishek Madav commented on SPARK-19532:


I am running into this issue wherein codepath similar to hiveWriterContainer is 
trying to the HDFS location. I tried setting spark.speculation to false but it 
doesn't seem to be the issue. Is there any workaround? This wait-time leads to 
make the job run real slow. 



> [Core]`DataStreamer for file` threads of DFSOutputStream leak if set 
> `spark.speculation` to true
> 
>
> Key: SPARK-19532
> URL: https://issues.apache.org/jira/browse/SPARK-19532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> When set `spark.speculation` to true, from thread dump page of Executor of 
> WebUI, I found that there are about 1300 threads named  "DataStreamer for 
> file 
> /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
>  in TIMED_WAITING state.
> {code}
> java.lang.Object.wait(Native Method)
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564)
> {code}
> The off-heap memory exceeds a lot until Executor exited with OOM exception. 
> This problem occurs only when writing data to the Hadoop(tasks may be killed 
> by Executor during writing).
> Could this be related to [https://issues.apache.org/jira/browse/HDFS-9812]? 
> The version of Hadoop is 2.6.4.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20615:


Assignee: (was: Apache Spark)

> SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector 
> has a size greater than zero but no elements defined.
> -
>
> Key: SPARK-20615
> URL: https://issues.apache.org/jira/browse/SPARK-20615
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Jon McLean
>Priority: Minor
>
> org.apache.spark.ml.linalg.SparseVector.argmax throws an 
> IndexOutOfRangeException when the vector size is greater than zero and no 
> values are defined.  The toString() representation of such a vector is " 
> (10,[],[])".  This is because the argmax function tries to get the value 
> at indexes(0) without checking the size of the array.
> Code inspection reveals that the mllib version of SparseVector should have 
> the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20615:


Assignee: Apache Spark

> SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector 
> has a size greater than zero but no elements defined.
> -
>
> Key: SPARK-20615
> URL: https://issues.apache.org/jira/browse/SPARK-20615
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Jon McLean
>Assignee: Apache Spark
>Priority: Minor
>
> org.apache.spark.ml.linalg.SparseVector.argmax throws an 
> IndexOutOfRangeException when the vector size is greater than zero and no 
> values are defined.  The toString() representation of such a vector is " 
> (10,[],[])".  This is because the argmax function tries to get the value 
> at indexes(0) without checking the size of the array.
> Code inspection reveals that the mllib version of SparseVector should have 
> the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.

2017-05-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998970#comment-15998970
 ] 

Apache Spark commented on SPARK-20615:
--

User 'jonmclean' has created a pull request for this issue:
https://github.com/apache/spark/pull/17877

> SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector 
> has a size greater than zero but no elements defined.
> -
>
> Key: SPARK-20615
> URL: https://issues.apache.org/jira/browse/SPARK-20615
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Jon McLean
>Priority: Minor
>
> org.apache.spark.ml.linalg.SparseVector.argmax throws an 
> IndexOutOfRangeException when the vector size is greater than zero and no 
> values are defined.  The toString() representation of such a vector is " 
> (10,[],[])".  This is because the argmax function tries to get the value 
> at indexes(0) without checking the size of the array.
> Code inspection reveals that the mllib version of SparseVector should have 
> the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20132) Add documentation for column string functions

2017-05-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20132:
---
Fix Version/s: 2.2.0

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Assignee: Michael Patterson
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 2.2.0, 2.3.0
>
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2017-05-05 Thread Rupesh Mane (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998900#comment-15998900
 ] 

Rupesh Mane commented on SPARK-18105:
-

I'm facing this issue with Spark 2.1.0 but not with Spark 2.0.2. I'm using AWS 
EMR 5.2.0 which has Spark 2.0.2 and jobs run successfully. With everything same 
(code, files to process, settings, etc.) when I use EMR 5.5.0 which has Spark 
2.1.0 I run in this issue. Stack trace is slightly different (see below), 
similar to this one: https://github.com/lz4/lz4-java/issues/13 and was fixed in 
2013. Comparing LZO binary dependency Spark 2.0.2 and Spark 2.1.0 both use LZ4 
1.3.0. So I'm confused why it is working on older version of Spark. Only 
difference in directory structure I see is Spark 2.0.2 has LZ4 libraries in lib 
but not under python/lib folder. While Spark 2.1.0 has these libraries in both 
lib and python/lib folder.


2017-05-05 01:15:50,681 [ERROR  ] schema: Exception raised during Operation: An 
error occurred while calling o104.save.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
0.0 (TID 6, ip-172-31-26-105.ec2.internal, executor 1): java.io.IOException: 
Stream is corrupted
at 
org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:163)
at 
org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
at 
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2606)
at 
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2622)
at 
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3099)
at 
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853)
at java.io.ObjectInputStream.(ObjectInputStream.java:349)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
at 

[jira] [Updated] (SPARK-19910) `stack` should not reject NULL values due to type mismatch

2017-05-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19910:
--
Affects Version/s: 2.1.1

> `stack` should not reject NULL values due to type mismatch
> --
>
> Key: SPARK-19910
> URL: https://issues.apache.org/jira/browse/SPARK-19910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: Dongjoon Hyun
>
> Since `stack` function generates a table with nullable columns, it should 
> allow mixed null values.
> {code}
> scala> sql("select stack(3, 1, 2, 3)").printSchema
> root
>  |-- col0: integer (nullable = true)
> scala> sql("select stack(3, 1, 2, null)").printSchema
> org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' 
> due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); 
> line 1 pos 7;
> 'Project [unresolvedalias(stack(3, 1, 2, null), None)]
> +- OneRowRelation$
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19910) `stack` should not reject NULL values due to type mismatch

2017-05-05 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998771#comment-15998771
 ] 

Dongjoon Hyun commented on SPARK-19910:
---

Hi, [~cloud_fan] and [~smilegator].
Could you review this issue and PR?

> `stack` should not reject NULL values due to type mismatch
> --
>
> Key: SPARK-19910
> URL: https://issues.apache.org/jira/browse/SPARK-19910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Dongjoon Hyun
>
> Since `stack` function generates a table with nullable columns, it should 
> allow mixed null values.
> {code}
> scala> sql("select stack(3, 1, 2, 3)").printSchema
> root
>  |-- col0: integer (nullable = true)
> scala> sql("select stack(3, 1, 2, null)").printSchema
> org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' 
> due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); 
> line 1 pos 7;
> 'Project [unresolvedalias(stack(3, 1, 2, null), None)]
> +- OneRowRelation$
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10878) Race condition when resolving Maven coordinates via Ivy

2017-05-05 Thread Jeeyoung Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998731#comment-15998731
 ] 

Jeeyoung Kim commented on SPARK-10878:
--

[~joshrosen] Yes, I realized what are potential race conditions (both inside 
Ivy and how Spark uses Ivy). Regarding (1), even if Ivy becomes thread-safe, 
writing a temporary pom file with a fixed filename would break things - thus I 
think this is valuable thing to to do. I can attempt a patch around this.

Regarding (2), I think it is quite inefficient solution, to have multiple 
resolution caches to get around this. My cache directory is half gigabytes 
right now, and having that per spark job seems inefficient.

> Race condition when resolving Maven coordinates via Ivy
> ---
>
> Key: SPARK-10878
> URL: https://issues.apache.org/jira/browse/SPARK-10878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Ryan Williams
>Priority: Minor
>
> I've recently been shell-scripting the creation of many concurrent 
> Spark-on-YARN apps and observing a fraction of them to fail with what I'm 
> guessing is a race condition in their Maven-coordinate resolution.
> For example, I might spawn an app for each path in file {{paths}} with the 
> following shell script:
> {code}
> cat paths | parallel "$SPARK_HOME/bin/spark-submit foo.jar {}"
> {code}
> When doing this, I observe some fraction of the spawned jobs to fail with 
> errors like:
> {code}
> :: retrieving :: org.apache.spark#spark-submit-parent
> confs: [default]
> Exception in thread "main" java.lang.RuntimeException: problem during 
> retrieve of org.apache.spark#spark-submit-parent: java.text.ParseException: 
> failed to parse report: 
> /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
>  Premature end of file.
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:249)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:83)
> at org.apache.ivy.Ivy.retrieve(Ivy.java:551)
> at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1006)
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.text.ParseException: failed to parse report: 
> /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
>  Premature end of file.
> at 
> org.apache.ivy.plugins.report.XmlReportParser.parse(XmlReportParser.java:293)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:329)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:118)
> ... 7 more
> Caused by: org.xml.sax.SAXParseException; Premature end of file.
> at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown 
> Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> {code}
> The more apps I try to launch simultaneously, the greater fraction of them 
> seem to fail with this or similar errors; a batch of ~10 will usually work 
> fine, a batch of 15 will see a few failures, and a batch of ~60 will have 
> dozens of failures.
> [This gist shows 11 recent failures I 
> observed|https://gist.github.com/ryan-williams/648bff70e518de0c7c84].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20603) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite deserialization of initial offset with Spark 2.1.0

2017-05-05 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20603:
-
Affects Version/s: 2.1.1
   2.1.0

> Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite deserialization of initial 
> offset with Spark 2.1.0
> --
>
> Key: SPARK-20603
> URL: https://issues.apache.org/jira/browse/SPARK-20603
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.1.2, 2.2.0
>
>
> This test is flaky. This is the recent failure: 
> https://spark-tests.appspot.com/builds/spark-branch-2.2-test-maven-hadoop-2.7/47



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20603) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite deserialization of initial offset with Spark 2.1.0

2017-05-05 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20603.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.2

> Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite deserialization of initial 
> offset with Spark 2.1.0
> --
>
> Key: SPARK-20603
> URL: https://issues.apache.org/jira/browse/SPARK-20603
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.1.2, 2.2.0
>
>
> This test is flaky. This is the recent failure: 
> https://spark-tests.appspot.com/builds/spark-branch-2.2-test-maven-hadoop-2.7/47



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20569:


Assignee: (was: Apache Spark)

> RuntimeReplaceable functions accept invalid third parameter
> ---
>
> Key: SPARK-20569
> URL: https://issues.apache.org/jira/browse/SPARK-20569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>Priority: Trivial
>
> >select  Nvl(null,'1',3);
> >3
> The function of "Nvl" has Only two  input parameters,so, when input three 
> parameters, i think it should notice that:"Error in query: Invalid number of 
> arguments for function nvl".
> Such as "nvl2", "nullIf","IfNull",these have a similar problem



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20569:


Assignee: Apache Spark

> RuntimeReplaceable functions accept invalid third parameter
> ---
>
> Key: SPARK-20569
> URL: https://issues.apache.org/jira/browse/SPARK-20569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Trivial
>
> >select  Nvl(null,'1',3);
> >3
> The function of "Nvl" has Only two  input parameters,so, when input three 
> parameters, i think it should notice that:"Error in query: Invalid number of 
> arguments for function nvl".
> Such as "nvl2", "nullIf","IfNull",these have a similar problem



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998690#comment-15998690
 ] 

Apache Spark commented on SPARK-20569:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17876

> RuntimeReplaceable functions accept invalid third parameter
> ---
>
> Key: SPARK-20569
> URL: https://issues.apache.org/jira/browse/SPARK-20569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>Priority: Trivial
>
> >select  Nvl(null,'1',3);
> >3
> The function of "Nvl" has Only two  input parameters,so, when input three 
> parameters, i think it should notice that:"Error in query: Invalid number of 
> arguments for function nvl".
> Such as "nvl2", "nullIf","IfNull",these have a similar problem



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20571) Flaky SparkR StructuredStreaming tests

2017-05-05 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998680#comment-15998680
 ] 

Burak Yavuz commented on SPARK-20571:
-

Thanks!

> Flaky SparkR StructuredStreaming tests
> --
>
> Key: SPARK-20571
> URL: https://issues.apache.org/jira/browse/SPARK-20571
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Assignee: Felix Cheung
> Fix For: 2.2.0, 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76399



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-05-05 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998654#comment-15998654
 ] 

Shixiong Zhu commented on SPARK-18971:
--

[~tgraves] No, as far as I known. But since Spark 2.2.0 has not yet been 
released, not sure how many people tested master or branch-2.2.

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.2.0
>
>
> Check https://github.com/netty/netty/issues/6153 for details
> You should be able to see the following similar stack track in the executor 
> thread dump.
> {code}
> "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE
> at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504)
> at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454)
> at io.netty.util.Recycler$Stack.pop(Recycler.java:435)
> at io.netty.util.Recycler.get(Recycler.java:144)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)
> at 
> io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:140)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-05-05 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998654#comment-15998654
 ] 

Shixiong Zhu edited comment on SPARK-18971 at 5/5/17 5:49 PM:
--

[~tgraves] No, as far as I know. But since Spark 2.2.0 has not yet been 
released, not sure how many people tested master or branch-2.2.


was (Author: zsxwing):
[~tgraves] No, as far as I known. But since Spark 2.2.0 has not yet been 
released, not sure how many people tested master or branch-2.2.

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.2.0
>
>
> Check https://github.com/netty/netty/issues/6153 for details
> You should be able to see the following similar stack track in the executor 
> thread dump.
> {code}
> "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE
> at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504)
> at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454)
> at io.netty.util.Recycler$Stack.pop(Recycler.java:435)
> at io.netty.util.Recycler.get(Recycler.java:144)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)
> at 
> io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:140)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20616) RuleExecutor logDebug of batch results should show diff to start of batch

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20616:


Assignee: Apache Spark

> RuleExecutor logDebug of batch results should show diff to start of batch
> -
>
> Key: SPARK-20616
> URL: https://issues.apache.org/jira/browse/SPARK-20616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>
> Due to a likely typo, the logDebug msg printing the diff of query plans shows 
> a diff to the initial plan, not diff to the start of batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20616) RuleExecutor logDebug of batch results should show diff to start of batch

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20616:


Assignee: (was: Apache Spark)

> RuleExecutor logDebug of batch results should show diff to start of batch
> -
>
> Key: SPARK-20616
> URL: https://issues.apache.org/jira/browse/SPARK-20616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>
> Due to a likely typo, the logDebug msg printing the diff of query plans shows 
> a diff to the initial plan, not diff to the start of batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20616) RuleExecutor logDebug of batch results should show diff to start of batch

2017-05-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998639#comment-15998639
 ] 

Apache Spark commented on SPARK-20616:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/17875

> RuleExecutor logDebug of batch results should show diff to start of batch
> -
>
> Key: SPARK-20616
> URL: https://issues.apache.org/jira/browse/SPARK-20616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>
> Due to a likely typo, the logDebug msg printing the diff of query plans shows 
> a diff to the initial plan, not diff to the start of batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20616) RuleExecutor logDebug of batch results should show diff to start of batch

2017-05-05 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-20616:
-

 Summary: RuleExecutor logDebug of batch results should show diff 
to start of batch
 Key: SPARK-20616
 URL: https://issues.apache.org/jira/browse/SPARK-20616
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Juliusz Sompolski


Due to a likely typo, the logDebug msg printing the diff of query plans shows a 
diff to the initial plan, not diff to the start of batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-05 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Priority: Minor  (was: Major)

> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>Priority: Minor
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, causing 
> executor removal and aggravating the overloading of driver, which leads to 
> more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> number of containers that driver can ask for in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired number of executors faster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20381) ObjectHashAggregateExec is missing numOutputRows

2017-05-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20381.
-
   Resolution: Fixed
 Assignee: yucai
Fix Version/s: 2.2.0

> ObjectHashAggregateExec is missing numOutputRows
> 
>
> Key: SPARK-20381
> URL: https://issues.apache.org/jira/browse/SPARK-20381
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: yucai
>Assignee: yucai
> Fix For: 2.2.0
>
>
> Add SQL metrics of numOutputRows for ObjectHashAggregateExec.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998544#comment-15998544
 ] 

Marcelo Vanzin commented on SPARK-20608:


Doesn't it work if you add the namespace (not the NN addresses) in the config 
instead?

e.g. {{hdfs://somenamespace}} instead of explicitly calling out the active and 
standby addresses. (That requires hdfs-site.xml to contain the namespace to 
namenode mappings, but that's generally how HA works anyway.)

The problem I see with the patch is that the fact that you're catching 
{{StandbyException}} probably means a token is not being generated for the 
standby. So when it actually becomes active, things will fail because Spark 
doesn't have the right token to talk to it.

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-05-05 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998531#comment-15998531
 ] 

Thomas Graves commented on SPARK-18971:
---

[~zsxwing]have you seen any issues with the new netty version?  We have hit a 
similar issue?

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.2.0
>
>
> Check https://github.com/netty/netty/issues/6153 for details
> You should be able to see the following similar stack track in the executor 
> thread dump.
> {code}
> "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE
> at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504)
> at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454)
> at io.netty.util.Recycler$Stack.pop(Recycler.java:435)
> at io.netty.util.Recycler.get(Recycler.java:144)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)
> at 
> io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:140)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-05-05 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998531#comment-15998531
 ] 

Thomas Graves edited comment on SPARK-18971 at 5/5/17 4:31 PM:
---

[~zsxwing]have you seen any issues with the new netty version?  We have hit 
this same issue.


was (Author: tgraves):
[~zsxwing]have you seen any issues with the new netty version?  We have hit a 
similar issue?

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.2.0
>
>
> Check https://github.com/netty/netty/issues/6153 for details
> You should be able to see the following similar stack track in the executor 
> thread dump.
> {code}
> "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE
> at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504)
> at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454)
> at io.netty.util.Recycler$Stack.pop(Recycler.java:435)
> at io.netty.util.Recycler.get(Recycler.java:144)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)
> at 
> io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:140)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20613) Double quotes in Windows batch script

2017-05-05 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-20613:
-

Assignee: Jarrett Meyer

> Double quotes in Windows batch script
> -
>
> Key: SPARK-20613
> URL: https://issues.apache.org/jira/browse/SPARK-20613
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.1.1
>Reporter: Jarrett Meyer
>Assignee: Jarrett Meyer
> Fix For: 2.1.2, 2.2.0, 2.3.0
>
>
> This is a new issue in version 2.1.1. This problem was not present in 2.1.0.
> In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
> like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
> immediately, producing something like
> {code}
> RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
>ab  c
> {code}
> The quote above {{a}} opens the quote. The quote above {{b}} closes the 
> quote. This creates a space at position {{c}}, which is invalid syntax.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20613) Double quotes in Windows batch script

2017-05-05 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998474#comment-15998474
 ] 

Felix Cheung commented on SPARK-20613:
--

[~shivaram]could you add jarretmeyer to contributor list in JIRA so I could 
resolve this bug to him?

> Double quotes in Windows batch script
> -
>
> Key: SPARK-20613
> URL: https://issues.apache.org/jira/browse/SPARK-20613
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.1.1
>Reporter: Jarrett Meyer
> Fix For: 2.1.2, 2.2.0, 2.3.0
>
>
> This is a new issue in version 2.1.1. This problem was not present in 2.1.0.
> In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
> like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
> immediately, producing something like
> {code}
> RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
>ab  c
> {code}
> The quote above {{a}} opens the quote. The quote above {{b}} closes the 
> quote. This creates a space at position {{c}}, which is invalid syntax.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20613) Double quotes in Windows batch script

2017-05-05 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20613.
--
  Resolution: Fixed
   Fix Version/s: 2.3.0
  2.2.0
  2.1.2
Target Version/s: 2.1.2, 2.2.0, 2.3.0

> Double quotes in Windows batch script
> -
>
> Key: SPARK-20613
> URL: https://issues.apache.org/jira/browse/SPARK-20613
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.1.1
>Reporter: Jarrett Meyer
> Fix For: 2.1.2, 2.2.0, 2.3.0
>
>
> This is a new issue in version 2.1.1. This problem was not present in 2.1.0.
> In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
> like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
> immediately, producing something like
> {code}
> RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
>ab  c
> {code}
> The quote above {{a}} opens the quote. The quote above {{b}} closes the 
> quote. This creates a space at position {{c}}, which is invalid syntax.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-05 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998472#comment-15998472
 ] 

Wenchen Fan commented on SPARK-20569:
-

yea this is a bug, I'm working on a fix

> RuntimeReplaceable functions accept invalid third parameter
> ---
>
> Key: SPARK-20569
> URL: https://issues.apache.org/jira/browse/SPARK-20569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>Priority: Trivial
>
> >select  Nvl(null,'1',3);
> >3
> The function of "Nvl" has Only two  input parameters,so, when input three 
> parameters, i think it should notice that:"Error in query: Invalid number of 
> arguments for function nvl".
> Such as "nvl2", "nullIf","IfNull",these have a similar problem



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20581) Using AVG or SUM on a INT/BIGINT column with fraction operator will yield BIGINT instead of DOUBLE

2017-05-05 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998465#comment-15998465
 ] 

Wenchen Fan commented on SPARK-20581:
-

[~smilegator] do you remember which PR fixed it? we can consider backport it.

> Using AVG or SUM on a INT/BIGINT column with fraction operator will yield 
> BIGINT instead of DOUBLE
> --
>
> Key: SPARK-20581
> URL: https://issues.apache.org/jira/browse/SPARK-20581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Dominic Ricard
>
> We stumbled on this multiple times and every time we are baffled by the 
> behavior of AVG and SUM.
> Given the following SQL (Executed through Thrift):
> {noformat}
> SELECT SUM(col/2) FROM
> (SELECT 3 as `col`) t
> {noformat}
> The result will be "1", when the expected and accurate result is 1.5
> Here's the explain plan:
> {noformat}
> == Physical Plan ==   
> TungstenAggregate(key=[], functions=[(sum(cast((cast(col#1519342 as double) / 
> 2.0) as bigint)),mode=Final,isDistinct=false)], output=[_c0#1519344L])  
> +- TungstenExchange SinglePartition, None 
>+- TungstenAggregate(key=[], functions=[(sum(cast((cast(col#1519342 as 
> double) / 2.0) as bigint)),mode=Partial,isDistinct=false)], 
> output=[sum#1519347L])  
>   +- Project [3 AS col#1519342]   
>  +- Scan OneRowRelation[] 
> {noformat}
> Why the extra cast to BIGINT?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.

2017-05-05 Thread Jon McLean (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998460#comment-15998460
 ] 

Jon McLean commented on SPARK-20615:


Thank you.  I will submit a patch with tests.

> SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector 
> has a size greater than zero but no elements defined.
> -
>
> Key: SPARK-20615
> URL: https://issues.apache.org/jira/browse/SPARK-20615
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Jon McLean
>Priority: Minor
>
> org.apache.spark.ml.linalg.SparseVector.argmax throws an 
> IndexOutOfRangeException when the vector size is greater than zero and no 
> values are defined.  The toString() representation of such a vector is " 
> (10,[],[])".  This is because the argmax function tries to get the value 
> at indexes(0) without checking the size of the array.
> Code inspection reveals that the mllib version of SparseVector should have 
> the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.

2017-05-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998459#comment-15998459
 ] 

Sean Owen commented on SPARK-20615:
---

Agree, I think you just want to return 0 if numActives == 0 early in the method.

> SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector 
> has a size greater than zero but no elements defined.
> -
>
> Key: SPARK-20615
> URL: https://issues.apache.org/jira/browse/SPARK-20615
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Jon McLean
>Priority: Minor
>
> org.apache.spark.ml.linalg.SparseVector.argmax throws an 
> IndexOutOfRangeException when the vector size is greater than zero and no 
> values are defined.  The toString() representation of such a vector is " 
> (10,[],[])".  This is because the argmax function tries to get the value 
> at indexes(0) without checking the size of the array.
> Code inspection reveals that the mllib version of SparseVector should have 
> the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.

2017-05-05 Thread Jon McLean (JIRA)
Jon McLean created SPARK-20615:
--

 Summary: SparseVector.argmax throws IndexOutOfBoundsException when 
the sparse vector has a size greater than zero but no elements defined.
 Key: SPARK-20615
 URL: https://issues.apache.org/jira/browse/SPARK-20615
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Jon McLean
Priority: Minor


org.apache.spark.ml.linalg.SparseVector.argmax throws an 
IndexOutOfRangeException when the vector size is greater than zero and no 
values are defined.  The toString() representation of such a vector is " 
(10,[],[])".  This is because the argmax function tries to get the value at 
indexes(0) without checking the size of the array.

Code inspection reveals that the mllib version of SparseVector should have the 
same issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20495) Add StorageLevel to cacheTable API

2017-05-05 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998442#comment-15998442
 ] 

Wenchen Fan commented on SPARK-20495:
-

we usually don't backport new API changes, but this one is very small and might 
be ok, cc [~redlighter]

> Add StorageLevel to cacheTable API 
> ---
>
> Key: SPARK-20495
> URL: https://issues.apache.org/jira/browse/SPARK-20495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
> Fix For: 2.3.0
>
>
> Currently, cacheTable API always uses the default MEMORY_AND_DISK storage 
> level. We can add a new cacheTable API with the extra parameter StorageLevel. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20495) Add StorageLevel to cacheTable API

2017-05-05 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998442#comment-15998442
 ] 

Wenchen Fan edited comment on SPARK-20495 at 5/5/17 3:00 PM:
-

we usually don't backport new API changes, but this one is very small and might 
be ok, cc [~smilegator]


was (Author: cloud_fan):
we usually don't backport new API changes, but this one is very small and might 
be ok, cc [~redlighter]

> Add StorageLevel to cacheTable API 
> ---
>
> Key: SPARK-20495
> URL: https://issues.apache.org/jira/browse/SPARK-20495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
> Fix For: 2.3.0
>
>
> Currently, cacheTable API always uses the default MEMORY_AND_DISK storage 
> level. We can add a new cacheTable API with the extra parameter StorageLevel. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20495) Add StorageLevel to cacheTable API

2017-05-05 Thread PJ Fanning (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998435#comment-15998435
 ] 

PJ Fanning commented on SPARK-20495:


Thanks everyone for working on this change. Is it too late to consider this for 
v2.2.0 or even v2.2.1?

> Add StorageLevel to cacheTable API 
> ---
>
> Key: SPARK-20495
> URL: https://issues.apache.org/jira/browse/SPARK-20495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
> Fix For: 2.3.0
>
>
> Currently, cacheTable API always uses the default MEMORY_AND_DISK storage 
> level. We can add a new cacheTable API with the extra parameter StorageLevel. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20495) Add StorageLevel to cacheTable API

2017-05-05 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20495.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17802
[https://github.com/apache/spark/pull/17802]

> Add StorageLevel to cacheTable API 
> ---
>
> Key: SPARK-20495
> URL: https://issues.apache.org/jira/browse/SPARK-20495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
> Fix For: 2.3.0
>
>
> Currently, cacheTable API always uses the default MEMORY_AND_DISK storage 
> level. We can add a new cacheTable API with the extra parameter StorageLevel. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20612) Unresolvable attribute in Filter won't throw analysis exception

2017-05-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998402#comment-15998402
 ] 

Apache Spark commented on SPARK-20612:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/17874

> Unresolvable attribute in Filter won't throw analysis exception
> ---
>
> Key: SPARK-20612
> URL: https://issues.apache.org/jira/browse/SPARK-20612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> We have a rule in Analyzer that adds missing attributes in a Filter into its 
> child plan. It makes the following codes work:
> {code}
> val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y")
> df.select("y").where("x=1")
> {code}
> It should throw an analysis exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20614) Use the same log4j configuration with Jenkins in AppVeyor

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20614:


Assignee: (was: Apache Spark)

> Use the same log4j configuration with Jenkins in AppVeyor
> -
>
> Key: SPARK-20614
> URL: https://issues.apache.org/jira/browse/SPARK-20614
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> Currently, there are flooding logs in AppVeyor (in the console). This has 
> been fine because we can download all the logs. However, (given my 
> observations so far), logs are truncated when there are too many.
> For example, see  
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master
> Even after the log is downloaded, it looks truncated as below:
> {code}
> [00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in 
> stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200)
> [00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 
> (TID 9213)
> [00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage 
> 601.0 (TID 9212). 2473 bytes result sent to driver
> {code}
> Probably, it looks better to use the same log4j configuration that we are 
> using for Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20614) Use the same log4j configuration with Jenkins in AppVeyor

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20614:


Assignee: Apache Spark

> Use the same log4j configuration with Jenkins in AppVeyor
> -
>
> Key: SPARK-20614
> URL: https://issues.apache.org/jira/browse/SPARK-20614
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> Currently, there are flooding logs in AppVeyor (in the console). This has 
> been fine because we can download all the logs. However, (given my 
> observations so far), logs are truncated when there are too many.
> For example, see  
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master
> Even after the log is downloaded, it looks truncated as below:
> {code}
> [00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in 
> stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200)
> [00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 
> (TID 9213)
> [00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage 
> 601.0 (TID 9212). 2473 bytes result sent to driver
> {code}
> Probably, it looks better to use the same log4j configuration that we are 
> using for Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20614) Use the same log4j configuration with Jenkins in AppVeyor

2017-05-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998359#comment-15998359
 ] 

Apache Spark commented on SPARK-20614:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17873

> Use the same log4j configuration with Jenkins in AppVeyor
> -
>
> Key: SPARK-20614
> URL: https://issues.apache.org/jira/browse/SPARK-20614
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> Currently, there are flooding logs in AppVeyor (in the console). This has 
> been fine because we can download all the logs. However, (given my 
> observations so far), logs are truncated when there are too many.
> For example, see  
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master
> Even after the log is downloaded, it looks truncated as below:
> {code}
> [00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in 
> stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200)
> [00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 
> (TID 9213)
> [00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage 
> 601.0 (TID 9212). 2473 bytes result sent to driver
> {code}
> Probably, it looks better to use the same log4j configuration that we are 
> using for Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20614) Use the same log4j configuration with Jenkins in AppVeyor

2017-05-05 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-20614:


 Summary: Use the same log4j configuration with Jenkins in AppVeyor
 Key: SPARK-20614
 URL: https://issues.apache.org/jira/browse/SPARK-20614
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon


Currently, there are flooding logs in AppVeyor (in the console). This has been 
fine because we can download all the logs. However, (given my observations so 
far), logs are truncated when there are too many.

For example, see  
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master

Even after the log is downloaded, it looks truncated as below:

{code}
[00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in stage 
601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200)
[00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 
(TID 9213)
[00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage 601.0 
(TID 9212). 2473 bytes result sent to driver
{code}

Probably, it looks better to use the same log4j configuration that we are using 
for Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20489) Different results in local mode and yarn mode when working with dates (race condition with SimpleDateFormat?)

2017-05-05 Thread Rick Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998228#comment-15998228
 ] 

Rick Moritz commented on SPARK-20489:
-

If someone could try and replicate my observations, I think that would be a 
great bit of help - the above code should run as-is.

> Different results in local mode and yarn mode when working with dates (race 
> condition with SimpleDateFormat?)
> -
>
> Key: SPARK-20489
> URL: https://issues.apache.org/jira/browse/SPARK-20489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: yarn-client mode in Zeppelin, Cloudera 
> Spark2-distribution
>Reporter: Rick Moritz
>Priority: Critical
>
> Running the following code (in Zeppelin, or spark-shell), I get different 
> results, depending on whether I am using local[*] -mode or yarn-client mode:
> {code:title=test case|borderStyle=solid}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> import spark.implicits._
> val counter = 1 to 2
> val size = 1 to 3
> val sampleText = spark.createDataFrame(
> sc.parallelize(size)
> .map(Row(_)),
> StructType(Array(StructField("id", IntegerType, nullable=false))
> )
> )
> .withColumn("loadDTS",lit("2017-04-25T10:45:02.2"))
> 
> val rddList = counter.map(
> count => sampleText
> .withColumn("loadDTS2", 
> date_format(date_add(col("loadDTS"),count),"-MM-dd'T'HH:mm:ss.SSS"))
> .drop(col("loadDTS"))
> .withColumnRenamed("loadDTS2","loadDTS")
> .coalesce(4)
> .rdd
> )
> val resultText = spark.createDataFrame(
> spark.sparkContext.union(rddList),
> sampleText.schema
> )
> val testGrouped = resultText.groupBy("id")
> val timestamps = testGrouped.agg(
> max(unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) as 
> "timestamp"
> )
> val loadDateResult = resultText.join(timestamps, "id")
> val filteredresult = loadDateResult.filter($"timestamp" === 
> unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS"))
> filteredresult.count
> {code}
> The expected result, *3* is what I obtain in local mode, but as soon as I run 
> fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get 
> some results (depending on the size of counter) - none of which makes any 
> sense.
> Up to the application of the last filter, at first glance everything looks 
> okay, but then something goes wrong. Potentially this is due to lingering 
> re-use of SimpleDateFormats, but I can't get it to happen in a 
> non-distributed mode. The generated execution plan is the same in each case, 
> as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998197#comment-15998197
 ] 

Yuechen Chen commented on SPARK-20608:
--

[~ste...@apache.org] Your worry is reasonable. In our tests, there are two 
possible exceptions when 
yarn.spark.access.namenodes=hdfs://activeNamenode,hdfs://standbyNamenode
1) Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category WRITE is not supported in state standby
2) Caused by: org.apache.hadoop.ipc.StandbyException: Operation category WRITE 
is not supported in state standby
Maybe RemoteException should be caught by better way.

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998189#comment-15998189
 ] 

Apache Spark commented on SPARK-20608:
--

User 'morenn520' has created a pull request for this issue:
https://github.com/apache/spark/pull/17872

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20613) Double quotes in Windows batch script

2017-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20613:
--
Priority: Major  (was: Blocker)

> Double quotes in Windows batch script
> -
>
> Key: SPARK-20613
> URL: https://issues.apache.org/jira/browse/SPARK-20613
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.1.1
>Reporter: Jarrett Meyer
>
> This is a new issue in version 2.1.1. This problem was not present in 2.1.0.
> In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
> like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
> immediately, producing something like
> {code}
> RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
>ab  c
> {code}
> The quote above {{a}} opens the quote. The quote above {{b}} closes the 
> quote. This creates a space at position {{c}}, which is invalid syntax.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20613) Double quotes in Windows batch script

2017-05-05 Thread Jarrett Meyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarrett Meyer updated SPARK-20613:
--
Description: 
This is a new issue in version 2.1.1. This problem was not present in 2.1.0.

In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
immediately, producing something like

{code}
RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
   ab  c
{code}

The quote above {{a}} opens the quote. The quote above {{b}} closes the quote. 
This creates a space at position {{c}}, which is invalid syntax.

  was:
In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
immediately, producing something like

{code}
RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
   ab  c
{code}

The quote above {{a}} opens the quote. The quote above {{b}} closes the quote. 
This creates a space at position {{c}}, which is invalid syntax.


> Double quotes in Windows batch script
> -
>
> Key: SPARK-20613
> URL: https://issues.apache.org/jira/browse/SPARK-20613
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.1.1
>Reporter: Jarrett Meyer
>Priority: Blocker
>
> This is a new issue in version 2.1.1. This problem was not present in 2.1.0.
> In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
> like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
> immediately, producing something like
> {code}
> RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
>ab  c
> {code}
> The quote above {{a}} opens the quote. The quote above {{b}} closes the 
> quote. This creates a space at position {{c}}, which is invalid syntax.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20613) Double quotes in Windows batch script

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20613:


Assignee: (was: Apache Spark)

> Double quotes in Windows batch script
> -
>
> Key: SPARK-20613
> URL: https://issues.apache.org/jira/browse/SPARK-20613
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.1.1
>Reporter: Jarrett Meyer
>Priority: Blocker
>
> In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
> like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
> immediately, producing something like
> {code}
> RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
>ab  c
> {code}
> The quote above {{a}} opens the quote. The quote above {{b}} closes the 
> quote. This creates a space at position {{c}}, which is invalid syntax.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20613) Double quotes in Windows batch script

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20613:


Assignee: Apache Spark

> Double quotes in Windows batch script
> -
>
> Key: SPARK-20613
> URL: https://issues.apache.org/jira/browse/SPARK-20613
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.1.1
>Reporter: Jarrett Meyer
>Assignee: Apache Spark
>Priority: Blocker
>
> In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
> like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
> immediately, producing something like
> {code}
> RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
>ab  c
> {code}
> The quote above {{a}} opens the quote. The quote above {{b}} closes the 
> quote. This creates a space at position {{c}}, which is invalid syntax.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20613) Double quotes in Windows batch script

2017-05-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998160#comment-15998160
 ] 

Apache Spark commented on SPARK-20613:
--

User 'jarrettmeyer' has created a pull request for this issue:
https://github.com/apache/spark/pull/17861

> Double quotes in Windows batch script
> -
>
> Key: SPARK-20613
> URL: https://issues.apache.org/jira/browse/SPARK-20613
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.1.1
>Reporter: Jarrett Meyer
>Priority: Blocker
>
> In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
> like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
> immediately, producing something like
> {code}
> RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
>ab  c
> {code}
> The quote above {{a}} opens the quote. The quote above {{b}} closes the 
> quote. This creates a space at position {{c}}, which is invalid syntax.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20613) Double quotes in Windows batch script

2017-05-05 Thread Jarrett Meyer (JIRA)
Jarrett Meyer created SPARK-20613:
-

 Summary: Double quotes in Windows batch script
 Key: SPARK-20613
 URL: https://issues.apache.org/jira/browse/SPARK-20613
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 2.1.1
Reporter: Jarrett Meyer
Priority: Blocker


In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
immediately, producing something like

{code}
RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
   ab  c
{code}

The quote above {{a}} opens the quote. The quote above {{b}} closes the quote. 
This creates a space at position {{c}}, which is invalid syntax.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998150#comment-15998150
 ] 

Steve Loughran commented on SPARK-20608:


Probably good to pull in someone who understands HDFS HA; I nominate 
[~liuml07]. 

My main worry is that RemoteException could be a symptom of something more 
serious than the node being in standby, but I don't know enough about NN HA for 
my opinions to be trusted.



> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20546) spark-class gets syntax error in posix mode

2017-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20546:
-

Assignee: Jessie Yu

> spark-class gets syntax error in posix mode
> ---
>
> Key: SPARK-20546
> URL: https://issues.apache.org/jira/browse/SPARK-20546
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.2
>Reporter: Jessie Yu
>Assignee: Jessie Yu
>Priority: Minor
> Fix For: 2.1.2, 2.2.1
>
>
> spark-class gets the following error when running in posix mode:
> {code}
> spark-class: line 78: syntax error near unexpected token `<'
> spark-class: line 78: `done < <(build_command "$@")'
> {code}
> \\
> It appears to be complaining about the process substitution: 
> {code}
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <(build_command "$@")
> {code}
> \\
> This can be reproduced by first turning on allexport then posix mode:
> {code}set -a -o posix {code}
> then run something like spark-shell which calls spark-class.
> \\
> The simplest fix is probably to always turn off posix mode in spark-class 
> before the while loop.
> \\
> This was previously reported in 
> [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed 
> with cannot reproduce. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20546) spark-class gets syntax error in posix mode

2017-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20546.
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.1

Issue resolved by pull request 17852
[https://github.com/apache/spark/pull/17852]

> spark-class gets syntax error in posix mode
> ---
>
> Key: SPARK-20546
> URL: https://issues.apache.org/jira/browse/SPARK-20546
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.2
>Reporter: Jessie Yu
>Priority: Minor
> Fix For: 2.2.1, 2.1.2
>
>
> spark-class gets the following error when running in posix mode:
> {code}
> spark-class: line 78: syntax error near unexpected token `<'
> spark-class: line 78: `done < <(build_command "$@")'
> {code}
> \\
> It appears to be complaining about the process substitution: 
> {code}
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <(build_command "$@")
> {code}
> \\
> This can be reproduced by first turning on allexport then posix mode:
> {code}set -a -o posix {code}
> then run something like spark-shell which calls spark-class.
> \\
> The simplest fix is probably to always turn off posix mode in spark-class 
> before the while loop.
> \\
> This was previously reported in 
> [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed 
> with cannot reproduce. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution

2017-05-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998050#comment-15998050
 ] 

Sean Owen commented on SPARK-20611:
---

No, there's not necessarily any problem in Spark. The Logging trait changed 
over versions of Spark -- not in CDH -- and if you don't match the versions 
correctly, this internal API may not be compatible across releases, because 
it's not an external API. For example you generally use Spark 2 in CDH 5.10 but 
you are targeting 1.6. CDH doesn't support the Kinesis connector, though it may 
happen to work. This is in any event not an issue for Spark.

> Spark kinesis connector doesnt work with  cloudera distribution
> ---
>
> Key: SPARK-20611
> URL: https://issues.apache.org/jira/browse/SPARK-20611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: sumit
>  Labels: cloudera
> Attachments: spark-kcl.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Facing below exception on CDH5.10
> 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 
> (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError
> at org.apache.spark.Logging$class.log(Logging.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:62)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> below is my POM file
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.0
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.6.0
> 
>  
> com.amazonaws
> amazon-kinesis-client
> 1.6.1
> 
> 
> org.apache.spark
> spark-streaming-kinesis-asl_2.10
> 1.6.0
> 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-20608:
-
Description: 
If one Spark Application need to access remote namenodes, 
yarn.spark.access.namenodes should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if yarn.spark.access.namenodes includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
Spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application Codes:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)


  was:
If one Spark Application need to access remote namenodes, 
{yarn.spark.access.namenodes} should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if {yarn.spark.access.namenodes} includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
{yarn.spark.access.namenodes}, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
Spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application Codes:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20612) Unresolvable attribute in Filter won't throw analysis exception

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20612:


Assignee: Apache Spark

> Unresolvable attribute in Filter won't throw analysis exception
> ---
>
> Key: SPARK-20612
> URL: https://issues.apache.org/jira/browse/SPARK-20612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> We have a rule in Analyzer that adds missing attributes in a Filter into its 
> child plan. It makes the following codes work:
> {code}
> val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y")
> df.select("y").where("x=1")
> {code}
> It should throw an analysis exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20612) Unresolvable attribute in Filter won't throw analysis exception

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20612:


Assignee: (was: Apache Spark)

> Unresolvable attribute in Filter won't throw analysis exception
> ---
>
> Key: SPARK-20612
> URL: https://issues.apache.org/jira/browse/SPARK-20612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> We have a rule in Analyzer that adds missing attributes in a Filter into its 
> child plan. It makes the following codes work:
> {code}
> val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y")
> df.select("y").where("x=1")
> {code}
> It should throw an analysis exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20612) Unresolvable attribute in Filter won't throw analysis exception

2017-05-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997993#comment-15997993
 ] 

Apache Spark commented on SPARK-20612:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/17871

> Unresolvable attribute in Filter won't throw analysis exception
> ---
>
> Key: SPARK-20612
> URL: https://issues.apache.org/jira/browse/SPARK-20612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> We have a rule in Analyzer that adds missing attributes in a Filter into its 
> child plan. It makes the following codes work:
> {code}
> val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y")
> df.select("y").where("x=1")
> {code}
> It should throw an analysis exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution

2017-05-05 Thread sumit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997987#comment-15997987
 ] 

sumit edited comment on SPARK-20611 at 5/5/17 9:32 AM:
---

Hi [~sowen]  does this mean I should log ticket to CDH . I thought as per spark 
external https://github.com/apache/spark/tree/master/external  it will get 
fixed here . I am running against same version of spark which is 1.6.  The 
issue is due to CDH distribution has modified the internal spark class for 
Logging

please see - https://issues.apache.org/jira/browse/LEGAL-198


was (Author: sumitkumarkarn):
Hi [~sowen]  does this mean I should log ticket to CDH . I thought as per spark 
external https://github.com/apache/spark/tree/master/external  it will get 
fixed here . I am running against same version of spark which is 1.6.  The 
issue is due to CDH distribution has modified the internal spark class for 
Logging

> Spark kinesis connector doesnt work with  cloudera distribution
> ---
>
> Key: SPARK-20611
> URL: https://issues.apache.org/jira/browse/SPARK-20611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: sumit
>  Labels: cloudera
> Attachments: spark-kcl.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Facing below exception on CDH5.10
> 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 
> (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError
> at org.apache.spark.Logging$class.log(Logging.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:62)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> below is my POM file
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.0
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.6.0
> 
>  
> com.amazonaws
> amazon-kinesis-client
> 1.6.1
> 
> 
> org.apache.spark
> spark-streaming-kinesis-asl_2.10
> 1.6.0
> 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution

2017-05-05 Thread sumit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997987#comment-15997987
 ] 

sumit edited comment on SPARK-20611 at 5/5/17 9:29 AM:
---

Hi [~sowen]  does this mean I should log ticket to CDH . I thought as per spark 
external https://github.com/apache/spark/tree/master/external  it will get 
fixed here . I am running against same version of spark which is 1.6.  The 
issue is due to CDH distribution has modified the internal spark class for 
Logging


was (Author: sumitkumarkarn):
Hi [~sowen]  does this mean I should log ticket to CDH . I thought as per spark 
external https://github.com/apache/spark/tree/master/external  it will get 
fixed here . 

> Spark kinesis connector doesnt work with  cloudera distribution
> ---
>
> Key: SPARK-20611
> URL: https://issues.apache.org/jira/browse/SPARK-20611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: sumit
>  Labels: cloudera
> Attachments: spark-kcl.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Facing below exception on CDH5.10
> 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 
> (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError
> at org.apache.spark.Logging$class.log(Logging.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:62)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> below is my POM file
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.0
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.6.0
> 
>  
> com.amazonaws
> amazon-kinesis-client
> 1.6.1
> 
> 
> org.apache.spark
> spark-streaming-kinesis-asl_2.10
> 1.6.0
> 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution

2017-05-05 Thread sumit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997987#comment-15997987
 ] 

sumit commented on SPARK-20611:
---

Hi [~sowen]  does this mean I should log ticket to CDH . I thought as per spark 
external https://github.com/apache/spark/tree/master/external  it will get 
fixed here . 

> Spark kinesis connector doesnt work with  cloudera distribution
> ---
>
> Key: SPARK-20611
> URL: https://issues.apache.org/jira/browse/SPARK-20611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: sumit
>  Labels: cloudera
> Attachments: spark-kcl.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Facing below exception on CDH5.10
> 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 
> (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError
> at org.apache.spark.Logging$class.log(Logging.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:62)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> below is my POM file
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.0
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.6.0
> 
>  
> com.amazonaws
> amazon-kinesis-client
> 1.6.1
> 
> 
> org.apache.spark
> spark-streaming-kinesis-asl_2.10
> 1.6.0
> 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20612) Unresolvable attribute in Filter won't throw analysis exception

2017-05-05 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-20612:
---

 Summary: Unresolvable attribute in Filter won't throw analysis 
exception
 Key: SPARK-20612
 URL: https://issues.apache.org/jira/browse/SPARK-20612
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Liang-Chi Hsieh


We have a rule in Analyzer that adds missing attributes in a Filter into its 
child plan. It makes the following codes work:

{code}
val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y")
df.select("y").where("x=1")
{code}

It should throw an analysis exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution

2017-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20611.
---
Resolution: Not A Problem

If a question is specific to CDH, it doesn't belong here, but rather at 
Cloudera.
No, it doesn't actually fix anything to duplicate the Logging trait.
We do not use patches. You should read http://spark.apache.org/contributing.html
The problem is version mismatch. I don't think you have built vs the same 
version of Spark that you run against.

> Spark kinesis connector doesnt work with  cloudera distribution
> ---
>
> Key: SPARK-20611
> URL: https://issues.apache.org/jira/browse/SPARK-20611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: sumit
>  Labels: cloudera
> Attachments: spark-kcl.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Facing below exception on CDH5.10
> 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 
> (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError
> at org.apache.spark.Logging$class.log(Logging.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:62)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> below is my POM file
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.0
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.6.0
> 
>  
> com.amazonaws
> amazon-kinesis-client
> 1.6.1
> 
> 
> org.apache.spark
> spark-streaming-kinesis-asl_2.10
> 1.6.0
> 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution

2017-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-20611.
-

> Spark kinesis connector doesnt work with  cloudera distribution
> ---
>
> Key: SPARK-20611
> URL: https://issues.apache.org/jira/browse/SPARK-20611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: sumit
>  Labels: cloudera
> Attachments: spark-kcl.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Facing below exception on CDH5.10
> 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 
> (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError
> at org.apache.spark.Logging$class.log(Logging.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:62)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> below is my POM file
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.0
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.6.0
> 
>  
> com.amazonaws
> amazon-kinesis-client
> 1.6.1
> 
> 
> org.apache.spark
> spark-streaming-kinesis-asl_2.10
> 1.6.0
> 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20608:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> {yarn.spark.access.namenodes} should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if {yarn.spark.access.namenodes} includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> {yarn.spark.access.namenodes}, and my Spark Application can be able to 
> sustain the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997948#comment-15997948
 ] 

Sean Owen commented on SPARK-20608:
---

CC [~vanzin] [~ste...@apache.org] 

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> {yarn.spark.access.namenodes} should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if {yarn.spark.access.namenodes} includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> {yarn.spark.access.namenodes}, and my Spark Application can be able to 
> sustain the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20472) Support for Dynamic Configuration

2017-05-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997946#comment-15997946
 ] 

Sean Owen commented on SPARK-20472:
---

JVM config matters. How do you change the driver heap size in client mode after 
startup?
What are the semantics of changing a batch size at runtime? cache size?
It raises a lot of questions, so no this is not generally possible.

> Support for Dynamic Configuration
> -
>
> Key: SPARK-20472
> URL: https://issues.apache.org/jira/browse/SPARK-20472
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.0
>Reporter: Shahbaz Hussain
>
> Currently Spark Configuration can not be dynamically changed.
> It requires Spark Job be killed and started again for a new configuration to 
> take in to effect.
> This bug is to enhance Spark ,such that configuration changes can be 
> dynamically changed without requiring a application restart.
> Ex: If Batch Interval in a Streaming Job is 20 seconds ,and if user wants to 
> reduce it to 5 seconds,currently it requires a re-submit of the job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-20608:
-
Description: 
If one Spark Application need to access remote namenodes, 
{yarn.spark.access.namenodes} should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if {yarn.spark.access.namenodes} includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
{yarn.spark.access.namenodes}, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
Spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application Codes:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)


  was:
If one Spark Application need to access remote namenodes, 
${yarn.spark.access.namenodes} should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if ${yarn.spark.access.namenodes} includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
Spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application Codes:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> {yarn.spark.access.namenodes} should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if {yarn.spark.access.namenodes} includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> {yarn.spark.access.namenodes}, and my Spark Application can be able to 
> sustain the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20608:


Assignee: Apache Spark

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Assignee: Apache Spark
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> ${yarn.spark.access.namenodes} should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if ${yarn.spark.access.namenodes} includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> ${yarn.spark.access.namenodes}, and my Spark Application can be able to 
> sustain the failover of Hadoop namenode.
> HA Examples:
> spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-20608:
-
Description: 
If one Spark Application need to access remote namenodes, 
${yarn.spark.access.namenodes} should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if ${yarn.spark.access.namenodes} includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
Spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application Codes:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)


  was:
If one Spark Application need to access remote namenodes, 
${yarn.spark.access.namenodes} should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if ${yarn.spark.access.namenodes} includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> ${yarn.spark.access.namenodes} should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if ${yarn.spark.access.namenodes} includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> ${yarn.spark.access.namenodes}, and my Spark Application can be able to 
> sustain the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20608:


Assignee: (was: Apache Spark)

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> ${yarn.spark.access.namenodes} should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if ${yarn.spark.access.namenodes} includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> ${yarn.spark.access.namenodes}, and my Spark Application can be able to 
> sustain the failover of Hadoop namenode.
> HA Examples:
> spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997900#comment-15997900
 ] 

Apache Spark commented on SPARK-20608:
--

User 'morenn520' has created a pull request for this issue:
https://github.com/apache/spark/pull/17870

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> ${yarn.spark.access.namenodes} should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if ${yarn.spark.access.namenodes} includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> ${yarn.spark.access.namenodes}, and my Spark Application can be able to 
> sustain the failover of Hadoop namenode.
> HA Examples:
> spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20611) Spark kinesis connector doesnt work with cloudera distribution

2017-05-05 Thread sumit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sumit updated SPARK-20611:
--
Summary: Spark kinesis connector doesnt work with  cloudera distribution  
(was: Spark kinesis connector doesn work with  cloudera distribution)

> Spark kinesis connector doesnt work with  cloudera distribution
> ---
>
> Key: SPARK-20611
> URL: https://issues.apache.org/jira/browse/SPARK-20611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: sumit
>  Labels: cloudera
> Attachments: spark-kcl.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Facing below exception on CDH5.10
> 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 
> (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError
> at org.apache.spark.Logging$class.log(Logging.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:62)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> below is my POM file
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.0
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.6.0
> 
>  
> com.amazonaws
> amazon-kinesis-client
> 1.6.1
> 
> 
> org.apache.spark
> spark-streaming-kinesis-asl_2.10
> 1.6.0
> 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20611) Spark kinesis connector doesn work with cloudera distribution

2017-05-05 Thread sumit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997889#comment-15997889
 ] 

sumit commented on SPARK-20611:
---

please evaluate and review the patch file . If it looks good then I would like 
to submit the PR against it. Thanks



> Spark kinesis connector doesn work with  cloudera distribution
> --
>
> Key: SPARK-20611
> URL: https://issues.apache.org/jira/browse/SPARK-20611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: sumit
>  Labels: cloudera
> Attachments: spark-kcl.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Facing below exception on CDH5.10
> 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 
> (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError
> at org.apache.spark.Logging$class.log(Logging.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:62)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> below is my POM file
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.0
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.6.0
> 
>  
> com.amazonaws
> amazon-kinesis-client
> 1.6.1
> 
> 
> org.apache.spark
> spark-streaming-kinesis-asl_2.10
> 1.6.0
> 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20611) Spark kinesis connector doesn work with cloudera distribution

2017-05-05 Thread sumit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sumit updated SPARK-20611:
--
Attachment: spark-kcl.patch

attached Patch is doing exactly same what we have done in the past for 
cassandra connector in the external link of this ticket i.e. 
https://datastax-oss.atlassian.net/browse/SPARKC-460


> Spark kinesis connector doesn work with  cloudera distribution
> --
>
> Key: SPARK-20611
> URL: https://issues.apache.org/jira/browse/SPARK-20611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: sumit
>  Labels: cloudera
> Attachments: spark-kcl.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Facing below exception on CDH5.10
> 17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 
> (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError
> at org.apache.spark.Logging$class.log(Logging.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:62)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119)
> at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50)
> at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> below is my POM file
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.0
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.6.0
> 
>  
> com.amazonaws
> amazon-kinesis-client
> 1.6.1
> 
> 
> org.apache.spark
> spark-streaming-kinesis-asl_2.10
> 1.6.0
> 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20611) Spark kinesis connector doesn work with cloudera distribution

2017-05-05 Thread sumit (JIRA)
sumit created SPARK-20611:
-

 Summary: Spark kinesis connector doesn work with  cloudera 
distribution
 Key: SPARK-20611
 URL: https://issues.apache.org/jira/browse/SPARK-20611
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: sumit


Facing below exception on CDH5.10

17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 
(TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:50)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39)
at org.apache.spark.Logging$class.logDebug(Logging.scala:62)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer.(KinesisCheckpointer.scala:50)
at 
org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

below is my POM file

org.apache.spark
spark-streaming_2.10
1.6.0



org.apache.spark
spark-core_2.10
1.6.0

 
com.amazonaws
amazon-kinesis-client
1.6.1


org.apache.spark
spark-streaming-kinesis-asl_2.10
1.6.0





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20610) Support a function get DataFrame/DataSet from Transformer

2017-05-05 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet closed SPARK-20610.
-
Resolution: Won't Fix

> Support a function get DataFrame/DataSet from Transformer
> -
>
> Key: SPARK-20610
> URL: https://issues.apache.org/jira/browse/SPARK-20610
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2, 2.1.0
>Reporter: darion yaphet
>
> We are using stages to build our machine learning pipeline. Transformer will 
> transformers input dataset into another output dataset as our dataframe. 
> Sometime we will test the dataframe's result when developing the pipeline. 
> But it is looks like difficulty  to running a test . If spark ml Stages could 
> support a interface to explore the dataframe processed by the stage , we 
> could use it to running test  . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20610) Support a function get DataFrame/DataSet from Transformer

2017-05-05 Thread darion yaphet (JIRA)
darion yaphet created SPARK-20610:
-

 Summary: Support a function get DataFrame/DataSet from Transformer
 Key: SPARK-20610
 URL: https://issues.apache.org/jira/browse/SPARK-20610
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.1.0, 2.0.2
Reporter: darion yaphet


We are using stages to build our machine learning pipeline. Transformer will 
transformers input dataset into another output dataset as our dataframe. 
Sometime we will test the dataframe's result when developing the pipeline. But 
it is looks like difficulty  to running a test . If spark ml Stages could 
support a interface to explore the dataframe processed by the stage , we could 
use it to running test  . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20472) Support for Dynamic Configuration

2017-05-05 Thread Shahbaz Hussain (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997874#comment-15997874
 ] 

Shahbaz Hussain commented on SPARK-20472:
-

Yes ,the idea is to have a way by which we can persist configuration in 
memory,like for Ex: Batch Interval ,sql shuffle partitions etc ,primarily these 
are Spark Specific configuration.
JVM configuration are global and cant be changed ,this request is not for 
Dyncamic Configuration for JVM but for Spark application specific.

> Support for Dynamic Configuration
> -
>
> Key: SPARK-20472
> URL: https://issues.apache.org/jira/browse/SPARK-20472
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.0
>Reporter: Shahbaz Hussain
>
> Currently Spark Configuration can not be dynamically changed.
> It requires Spark Job be killed and started again for a new configuration to 
> take in to effect.
> This bug is to enhance Spark ,such that configuration changes can be 
> dynamically changed without requiring a application restart.
> Ex: If Batch Interval in a Streaming Job is 20 seconds ,and if user wants to 
> reduce it to 5 seconds,currently it requires a re-submit of the job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20545) union set operator should default to DISTINCT and not ALL semantics

2017-05-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-20545.
---
Resolution: Cannot Reproduce

> union set operator should default to DISTINCT and not ALL semantics
> ---
>
> Key: SPARK-20545
> URL: https://issues.apache.org/jira/browse/SPARK-20545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>
> A set operation (i.e union) over two queries that produce identical row 
> values should return the distinct set of rows and not all rows.
> ISO-SQL set operation semantics default to DISTINCT 
> SPARK implementation is defaulting to ALL
> While SPARK allows DISTINCT keyword and some might assume ALL is faster, the 
> wrong result set semantically is produced per standard (and commercial SQL 
> systems including: ORACLE, DB2, Teradata, SQL Server etc.)
> select tsint.csint from cert.tsint 
> union 
> select tint.cint from cert.tint 
> csint
> 
> -1
> 0
> 1
> 10
> 
> -1
> 0
> 1
> 10
> vs
> select tsint.csint from cert.tsint union distinct select tint.cint from 
> cert.tint 
> csint
> -1
> 
> 1
> 10
> 0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20545) union set operator should default to DISTINCT and not ALL semantics

2017-05-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997845#comment-15997845
 ] 

Xiao Li commented on SPARK-20545:
-

Please reopen it if you still hit this issue. Thanks!

> union set operator should default to DISTINCT and not ALL semantics
> ---
>
> Key: SPARK-20545
> URL: https://issues.apache.org/jira/browse/SPARK-20545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>
> A set operation (i.e union) over two queries that produce identical row 
> values should return the distinct set of rows and not all rows.
> ISO-SQL set operation semantics default to DISTINCT 
> SPARK implementation is defaulting to ALL
> While SPARK allows DISTINCT keyword and some might assume ALL is faster, the 
> wrong result set semantically is produced per standard (and commercial SQL 
> systems including: ORACLE, DB2, Teradata, SQL Server etc.)
> select tsint.csint from cert.tsint 
> union 
> select tint.cint from cert.tint 
> csint
> 
> -1
> 0
> 1
> 10
> 
> -1
> 0
> 1
> 10
> vs
> select tsint.csint from cert.tsint union distinct select tint.cint from 
> cert.tint 
> csint
> -1
> 
> 1
> 10
> 0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20545) union set operator should default to DISTINCT and not ALL semantics

2017-05-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997843#comment-15997843
 ] 

Xiao Li commented on SPARK-20545:
-

You can try 
{noformat}
select 3 as `col` union select 3 as `col` 
{noformat}

It outputs 3. 

In Spark SQL, if neither ALL nor DISTINCT is used, DISTINCT behavior is the 
default.

> union set operator should default to DISTINCT and not ALL semantics
> ---
>
> Key: SPARK-20545
> URL: https://issues.apache.org/jira/browse/SPARK-20545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>
> A set operation (i.e union) over two queries that produce identical row 
> values should return the distinct set of rows and not all rows.
> ISO-SQL set operation semantics default to DISTINCT 
> SPARK implementation is defaulting to ALL
> While SPARK allows DISTINCT keyword and some might assume ALL is faster, the 
> wrong result set semantically is produced per standard (and commercial SQL 
> systems including: ORACLE, DB2, Teradata, SQL Server etc.)
> select tsint.csint from cert.tsint 
> union 
> select tint.cint from cert.tint 
> csint
> 
> -1
> 0
> 1
> 10
> 
> -1
> 0
> 1
> 10
> vs
> select tsint.csint from cert.tsint union distinct select tint.cint from 
> cert.tint 
> csint
> -1
> 
> 1
> 10
> 0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >