[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function
[ https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-43201: - Description: Spark from_avro function does not allow schema parameter to use dataframe column but takes only a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect like from_json function: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} was: Spark from_avro function does not allow schema parameter to use dataframe column but takes only a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} > Inconsistency between from_avro and from_json function > -- > > Key: SPARK-43201 > URL: https://issues.apache.org/jira/browse/SPARK-43201 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Philip Adetiloye >Priority: Major > > Spark from_avro function does not allow schema parameter to use dataframe > column but takes only a String schema: > {code:java} > def from_avro(col: Column, jsonFormatSchema: String): Column {code} > This makes it impossible to deserialize rows of Avro records with different > schema since only one schema string could be pass externally. > > Here is what I would expect like from_json function: > {code:java} > def from_avro(col: Column, jsonFormatSchema: Column): Column {code} > code example: > {code:java} > import org.apache.spark.sql.functions.from_avro > val avroSchema1 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > > val avroSchema2 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > val df = Seq( > (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), > (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) > ).toDF("binaryData", "schema") > val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) > parsed.show() > // Output: > // ++ > // | parsedData| > // ++ > // |[apple1, 1.0]| > // |[apple2, 2.0]| > // ++ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function
[ https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-43201: - Description: Spark from_avro function does not allow schema parameter to use dataframe column but takes only a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} was: Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} > Inconsistency between from_avro and from_json function > -- > > Key: SPARK-43201 > URL: https://issues.apache.org/jira/browse/SPARK-43201 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Philip Adetiloye >Priority: Major > > Spark from_avro function does not allow schema parameter to use dataframe > column but takes only a String schema: > {code:java} > def from_avro(col: Column, jsonFormatSchema: String): Column {code} > This makes it impossible to deserialize rows of Avro records with different > schema since only one schema string could be pass externally. > > Here is what I would expect: > {code:java} > def from_avro(col: Column, jsonFormatSchema: Column): Column {code} > code example: > {code:java} > import org.apache.spark.sql.functions.from_avro > val avroSchema1 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > > val avroSchema2 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > val df = Seq( > (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), > (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) > ).toDF("binaryData", "schema") > val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) > parsed.show() > // Output: > // ++ > // | parsedData| > // ++ > // |[apple1, 1.0]| > // |[apple2, 2.0]| > // ++ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function
[ https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-43201: - Description: Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} was: Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""val val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} > Inconsistency between from_avro and from_json function > -- > > Key: SPARK-43201 > URL: https://issues.apache.org/jira/browse/SPARK-43201 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Philip Adetiloye >Priority: Major > > Spark from_avro function does not allow schema to use dataframe column but > takes a String schema: > {code:java} > def from_avro(col: Column, jsonFormatSchema: String): Column {code} > This makes it impossible to deserialize rows of Avro records with different > schema since only one schema string could be pass externally. > > Here is what I would expect: > {code:java} > def from_avro(col: Column, jsonFormatSchema: Column): Column {code} > code example: > {code:java} > import org.apache.spark.sql.functions.from_avro > val avroSchema1 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > > val avroSchema2 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > val df = Seq( > (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), > (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) > ).toDF("binaryData", "schema") > val parsed = df.select(from_avro($"binaryData", > $"schema").as("parsedData"))parsed.show() > // Output: > // ++ > // | parsedData| > // ++ > // |[apple1, 1.0]| > // |[apple2, 2.0]| > // ++ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function
[ https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-43201: - Description: Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} was: Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} > Inconsistency between from_avro and from_json function > -- > > Key: SPARK-43201 > URL: https://issues.apache.org/jira/browse/SPARK-43201 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Philip Adetiloye >Priority: Major > > Spark from_avro function does not allow schema to use dataframe column but > takes a String schema: > {code:java} > def from_avro(col: Column, jsonFormatSchema: String): Column {code} > This makes it impossible to deserialize rows of Avro records with different > schema since only one schema string could be pass externally. > > Here is what I would expect: > {code:java} > def from_avro(col: Column, jsonFormatSchema: Column): Column {code} > code example: > {code:java} > import org.apache.spark.sql.functions.from_avro > val avroSchema1 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > > val avroSchema2 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > val df = Seq( > (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), > (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) > ).toDF("binaryData", "schema") > val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) > parsed.show() > // Output: > // ++ > // | parsedData| > // ++ > // |[apple1, 1.0]| > // |[apple2, 2.0]| > // ++ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43201) Inconsistency between from_avro and from_json function
Philip Adetiloye created SPARK-43201: Summary: Inconsistency between from_avro and from_json function Key: SPARK-43201 URL: https://issues.apache.org/jira/browse/SPARK-43201 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Philip Adetiloye Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""val val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984428#comment-15984428 ] Philip Adetiloye commented on SPARK-20470: -- [~srowen] I'm sorry, I added an example. > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > {code} > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > {code} > Df schema looks good > {code} > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > {code} > Need to convert each row to json now and save to HBase > {code} > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > {code} > Output JSON (Wrong) > {code} > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], > [ > 2.2925065264270024, > 814.0, > 0.15644850561960366 > ], > [ > 2.448955032046606, > 690.0, > 0.1564485056196041 > ] > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-20470: - Description: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: {code} { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } {code} Df schema looks good {code} root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) {code} Need to convert each row to json now and save to HBase {code} rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( {code} Output JSON (Wrong) {code} { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } {code} was: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: {code} { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } {code} Df schema looks good {code} root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) {code} Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) {code} { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } {code} > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > {code} > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > {code} > Df schema looks good > {code} > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > {code} > Need to convert each row to json now and save to HBase > {code} > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > {code} > Output JSON (Wrong) > {code} > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.13605802080739
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-20470: - Description: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: {code} { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } {code} Df schema looks good {code} root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) {code} Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) {code} { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } {code} was: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good {code} root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) {code} Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > {code} > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > {code} > Df schema looks good > {code} > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > {code} > Need to convert each row to json now and save to HBase > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > Output JSON (Wrong) > {code} > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], >
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-20470: - Description: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good {code} root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) {code} Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } was: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > Df schema looks good > {code} > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > {code} > Need to convert each row to json now and save to HBase > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > Output JSON (Wrong) > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], > [ > 2.2925065264270024, > 814.0, > 0.156448505
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-20470: - Description: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } was: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good # root # |-- feature: string (nullable = true) # |-- histogram: array (nullable = true) # ||-- element: struct (containsNull = true) # |||-- start: double (nullable = true) # |||-- width: double (nullable = true) # |||-- y: double (nullable = true) Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > Df schema looks good > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > Need to convert each row to json now and save to HBase > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > Output JSON (Wrong) > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], > [ > 2.2925065264270024, > 814.0, > 0.15644850561960366 > ], >
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-20470: - Description: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good # root # |-- feature: string (nullable = true) # |-- histogram: array (nullable = true) # ||-- element: struct (containsNull = true) # |||-- start: double (nullable = true) # |||-- width: double (nullable = true) # |||-- y: double (nullable = true) Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } was: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > Df schema looks good > # root > # |-- feature: string (nullable = true) > # |-- histogram: array (nullable = true) > # ||-- element: struct (containsNull = true) > # |||-- start: double (nullable = true) > # |||-- width: double (nullable = true) > # |||-- y: double (nullable = true) > Need to convert each row to json now and save to HBase > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > Output JSON (Wrong) > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], > [ > 2.2925065264270024, > 814.0, > 0.15644850561960366 > ], > [ > 2.448955032046606, > 690.0, > 0.1564485056196041 > ] > } -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
Philip Adetiloye created SPARK-20470: Summary: Invalid json converting RDD row with Array of struct to json Key: SPARK-20470 URL: https://issues.apache.org/jira/browse/SPARK-20470 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.3 Reporter: Philip Adetiloye Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18363) Connected component for large graph result is wrong
[ https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15659686#comment-15659686 ] Philip Adetiloye edited comment on SPARK-18363 at 11/12/16 1:47 PM: duplicate graph vertice ID causes this issue was (Author: pkadetiloye): duplicated graph vertice ID causes this issue > Connected component for large graph result is wrong > --- > > Key: SPARK-18363 > URL: https://issues.apache.org/jira/browse/SPARK-18363 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.1 >Reporter: Philip Adetiloye > > The clustering done by Graphx connected component doesn't seems to work > correctly with large nodes. > It only works correctly on a small graph -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18363) Connected component for large graph result is wrong
[ https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye closed SPARK-18363. Resolution: Resolved duplicated graph vertice ID causes this issue > Connected component for large graph result is wrong > --- > > Key: SPARK-18363 > URL: https://issues.apache.org/jira/browse/SPARK-18363 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.1 >Reporter: Philip Adetiloye > > The clustering done by Graphx connected component doesn't seems to work > correctly with large nodes. > It only works correctly on a small graph -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18363) Connected component for large graph result is wrong
[ https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652009#comment-15652009 ] Philip Adetiloye commented on SPARK-18363: -- I logged a similar Issue with the graphframe but the problem exist also in Graphx Basically I'm trying to cluster a hierarchical dataset. This works fine for small dataset, I could cluster the data into separate clusters. However, for large hierarchical dataset (about `1.60million vertices`) the result seems wrong. The resulting clusters from connected component have many intersections. This should not be the case. I expect the hierarchical dataset to be clustered into separate smaller clusters. ... val vertices = universe.map(u => (u.id, u.username, u.age, u.gamescore)) .toDF("id", "username", "age","gamescore") .alias("v") val lookup = sparkSession.sparkContext.broadcast(universeMap.rdd.collectAsMap()) def buildEdges(src: String, dest: String) = { Edge(lookup.value.get(src).get, lookup.value.get(dest).get, 0) } val edges = similarityDatasetNoJboss.mapPartitions(_.map(s => buildEdges(s.username1, s.username2))) .toDF("src", "dst", "default") val graph = GraphFrame(vertices, edges) val cc = graph.connectedComponents.run().select("id", "component") Do some validation test Select id, count(component) group by id I expect each`id` to belong to one cluster/component and count = 1 instead `id` belong to multiple clusters/component. > Connected component for large graph result is wrong > --- > > Key: SPARK-18363 > URL: https://issues.apache.org/jira/browse/SPARK-18363 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.1 >Reporter: Philip Adetiloye > > The clustering done by Graphx connected component doesn't seems to work > correctly with large nodes. > It only works correctly on a small graph -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18363) Connected component for large graph result is wrong
Philip Adetiloye created SPARK-18363: Summary: Connected component for large graph result is wrong Key: SPARK-18363 URL: https://issues.apache.org/jira/browse/SPARK-18363 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.0.1 Reporter: Philip Adetiloye The clustering done by Graphx connected component doesn't seems to work correctly with large nodes. It only works correctly on a small graph -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233 ] Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:17 PM: -- [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop" SPARK_YARN_QUEUE="dev" and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed "s:bin/java::") export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop" export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native" export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH Hope this helps. Thanks, Phil was (Author: pkadetiloye): [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop" SPARK_YARN_QUEUE="dev" and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed "s:bin/java::") export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop" export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native" export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH Hope this helps. Thanks, - Phil > Failed to connect to yarn / spark-submit --master yarn-client > - > > Key: SPARK-9485 > URL: https://issues.apache.org/jira/browse/SPARK-9485 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit, YARN >Affects Versions: 1.4.1 > Environment: DEV >Reporter: Philip Adetiloye >Priority: Minor > > Spark-submit throws an exception when connecting to yarn but it works when > used in standalone mode. > I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but > got the same exception below. > spark-submit --master yarn-client > Here is a stack trace of the exception: > 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler > 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all > executors > Exception in thread "Yarn application state monitor" > org.apache.spark.SparkException: Error asking standalone schedule > r to shut down executors > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken > d.scala:261) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 > 66) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) > at > org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) > at > org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: > 139) > Caused by: java.lang
[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233 ] Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:16 PM: -- [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop" SPARK_YARN_QUEUE="dev" and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed "s:bin/java::") export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop" export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native" export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH Hope this helps. Thanks, - Phil was (Author: pkadetiloye): [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop" SPARK_YARN_QUEUE="dev" and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: ` export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed "s:bin/java::") export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop" export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native" export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH ` Hope this helps. Thanks, - Phil > Failed to connect to yarn / spark-submit --master yarn-client > - > > Key: SPARK-9485 > URL: https://issues.apache.org/jira/browse/SPARK-9485 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit, YARN >Affects Versions: 1.4.1 > Environment: DEV >Reporter: Philip Adetiloye >Priority: Minor > > Spark-submit throws an exception when connecting to yarn but it works when > used in standalone mode. > I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but > got the same exception below. > spark-submit --master yarn-client > Here is a stack trace of the exception: > 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler > 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all > executors > Exception in thread "Yarn application state monitor" > org.apache.spark.SparkException: Error asking standalone schedule > r to shut down executors > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken > d.scala:261) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 > 66) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) > at > org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) > at > org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: > 139) > Caused by: java.lang.Inte
[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233 ] Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:16 PM: -- [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop" SPARK_YARN_QUEUE="dev" and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: ` export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed "s:bin/java::") export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop" export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native" export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH ` Hope this helps. Thanks, - Phil was (Author: pkadetiloye): [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop" SPARK_YARN_QUEUE="dev" and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed "s:bin/java::") export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop" export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native" export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH Hope this helps. Thanks, - Phil > Failed to connect to yarn / spark-submit --master yarn-client > - > > Key: SPARK-9485 > URL: https://issues.apache.org/jira/browse/SPARK-9485 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit, YARN >Affects Versions: 1.4.1 > Environment: DEV >Reporter: Philip Adetiloye >Priority: Minor > > Spark-submit throws an exception when connecting to yarn but it works when > used in standalone mode. > I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but > got the same exception below. > spark-submit --master yarn-client > Here is a stack trace of the exception: > 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler > 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all > executors > Exception in thread "Yarn application state monitor" > org.apache.spark.SparkException: Error asking standalone schedule > r to shut down executors > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken > d.scala:261) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 > 66) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) > at > org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) > at > org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: > 139) > Caused by: java.lang.Inter
[jira] [Commented] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233 ] Philip Adetiloye commented on SPARK-9485: - [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop" SPARK_YARN_QUEUE="dev" and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed "s:bin/java::") export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop" export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native" export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH Hope this helps. Thanks, - Phil > Failed to connect to yarn / spark-submit --master yarn-client > - > > Key: SPARK-9485 > URL: https://issues.apache.org/jira/browse/SPARK-9485 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit, YARN >Affects Versions: 1.4.1 > Environment: DEV >Reporter: Philip Adetiloye >Priority: Minor > > Spark-submit throws an exception when connecting to yarn but it works when > used in standalone mode. > I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but > got the same exception below. > spark-submit --master yarn-client > Here is a stack trace of the exception: > 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler > 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all > executors > Exception in thread "Yarn application state monitor" > org.apache.spark.SparkException: Error asking standalone schedule > r to shut down executors > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken > d.scala:261) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 > 66) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) > at > org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) > at > org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: > 139) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java > :1326) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal > a:945) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke
[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-9485: Shepherd: MEN CHAMROEUN > Failed to connect to yarn / spark-submit --master yarn-client > - > > Key: SPARK-9485 > URL: https://issues.apache.org/jira/browse/SPARK-9485 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit, YARN >Affects Versions: 1.4.1 > Environment: DEV >Reporter: Philip Adetiloye >Priority: Minor > > Spark-submit throws an exception when connecting to yarn but it works when > used in standalone mode. > I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but > got the same exception below. > spark-submit --master yarn-client > Here is a stack trace of the exception: > 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler > 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all > executors > Exception in thread "Yarn application state monitor" > org.apache.spark.SparkException: Error asking standalone schedule > r to shut down executors > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken > d.scala:261) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 > 66) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) > at > org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) > at > org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: > 139) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java > :1326) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal > a:945) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > java.lang.NullPointerException > at org.apache.spark.sql.SQLContext.(SQLContext.scala:193) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at
[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-9485: Description: Spark-submit throws an exception when connecting to yarn but it works when used in standalone mode. I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but got the same exception below. spark-submit --master yarn-client Here is a stack trace of the exception: 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors Exception in thread "Yarn application state monitor" org.apache.spark.SparkException: Error asking standalone schedule r to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken d.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 66) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: 139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java :1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal a:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.(SQLContext.scala:193) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033) at $iwC$$iwC.(:9) at $iwC.(:18) at (:20) at .(:24) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at
[jira] [Created] (SPARK-9485) Failed to connect to yarn
Philip Adetiloye created SPARK-9485: --- Summary: Failed to connect to yarn Key: SPARK-9485 URL: https://issues.apache.org/jira/browse/SPARK-9485 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit, YARN Affects Versions: 1.4.1 Environment: DEV Reporter: Philip Adetiloye Priority: Minor Spark-submit throws an exception when connecting to yarn but it works when used in standalone mode. I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but got the same exception below. Here is a stack trace of the exception: 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors Exception in thread "Yarn application state monitor" org.apache.spark.SparkException: Error asking standalone schedule r to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken d.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 66) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: 139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java :1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal a:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.(SQLContext.scala:193) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033) at $iwC$$iwC.(:9) at $iwC.(:18) at (:20) at .(:24) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInt