[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function

2023-04-20 Thread Philip Adetiloye (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-43201:
-
Description: 
Spark from_avro function does not allow schema parameter to use dataframe 
column but takes only a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect like from_json function:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))


parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 

  was:
Spark from_avro function does not allow schema parameter to use dataframe 
column but takes only a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))


parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 


> Inconsistency between from_avro and from_json function
> --
>
> Key: SPARK-43201
> URL: https://issues.apache.org/jira/browse/SPARK-43201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Philip Adetiloye
>Priority: Major
>
> Spark from_avro function does not allow schema parameter to use dataframe 
> column but takes only a String schema:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: String): Column {code}
> This makes it impossible to deserialize rows of Avro records with different 
> schema since only one schema string could be pass externally. 
>  
> Here is what I would expect like from_json function:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
> code example:
> {code:java}
> import org.apache.spark.sql.functions.from_avro
> val avroSchema1 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
>  
> val avroSchema2 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
> val df = Seq(
>   (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
>   (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
> ).toDF("binaryData", "schema")
> val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))
> parsed.show()
> // Output:
> // ++
> // |  parsedData|
> // ++
> // |[apple1, 1.0]|
> // |[apple2, 2.0]|
> // ++
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function

2023-04-20 Thread Philip Adetiloye (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-43201:
-
Description: 
Spark from_avro function does not allow schema parameter to use dataframe 
column but takes only a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))


parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 

  was:
Spark from_avro function does not allow schema to use dataframe column but 
takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))


parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 


> Inconsistency between from_avro and from_json function
> --
>
> Key: SPARK-43201
> URL: https://issues.apache.org/jira/browse/SPARK-43201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Philip Adetiloye
>Priority: Major
>
> Spark from_avro function does not allow schema parameter to use dataframe 
> column but takes only a String schema:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: String): Column {code}
> This makes it impossible to deserialize rows of Avro records with different 
> schema since only one schema string could be pass externally. 
>  
> Here is what I would expect:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
> code example:
> {code:java}
> import org.apache.spark.sql.functions.from_avro
> val avroSchema1 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
>  
> val avroSchema2 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
> val df = Seq(
>   (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
>   (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
> ).toDF("binaryData", "schema")
> val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))
> parsed.show()
> // Output:
> // ++
> // |  parsedData|
> // ++
> // |[apple1, 1.0]|
> // |[apple2, 2.0]|
> // ++
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function

2023-04-19 Thread Philip Adetiloye (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-43201:
-
Description: 
Spark from_avro function does not allow schema to use dataframe column but 
takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", 
$"schema").as("parsedData"))parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 

  was:
Spark from_avro function does not allow schema to use dataframe column but 
takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""val
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", 
$"schema").as("parsedData"))parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 


> Inconsistency between from_avro and from_json function
> --
>
> Key: SPARK-43201
> URL: https://issues.apache.org/jira/browse/SPARK-43201
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Philip Adetiloye
>Priority: Major
>
> Spark from_avro function does not allow schema to use dataframe column but 
> takes a String schema:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: String): Column {code}
> This makes it impossible to deserialize rows of Avro records with different 
> schema since only one schema string could be pass externally. 
>  
> Here is what I would expect:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
> code example:
> {code:java}
> import org.apache.spark.sql.functions.from_avro
> val avroSchema1 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
>  
> val avroSchema2 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
> val df = Seq(
>   (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
>   (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
> ).toDF("binaryData", "schema")
> val parsed = df.select(from_avro($"binaryData", 
> $"schema").as("parsedData"))parsed.show()
> // Output:
> // ++
> // |  parsedData|
> // ++
> // |[apple1, 1.0]|
> // |[apple2, 2.0]|
> // ++
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function

2023-04-19 Thread Philip Adetiloye (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-43201:
-
Description: 
Spark from_avro function does not allow schema to use dataframe column but 
takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))


parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 

  was:
Spark from_avro function does not allow schema to use dataframe column but 
takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", 
$"schema").as("parsedData"))parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 


> Inconsistency between from_avro and from_json function
> --
>
> Key: SPARK-43201
> URL: https://issues.apache.org/jira/browse/SPARK-43201
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Philip Adetiloye
>Priority: Major
>
> Spark from_avro function does not allow schema to use dataframe column but 
> takes a String schema:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: String): Column {code}
> This makes it impossible to deserialize rows of Avro records with different 
> schema since only one schema string could be pass externally. 
>  
> Here is what I would expect:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
> code example:
> {code:java}
> import org.apache.spark.sql.functions.from_avro
> val avroSchema1 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
>  
> val avroSchema2 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
> val df = Seq(
>   (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
>   (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
> ).toDF("binaryData", "schema")
> val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))
> parsed.show()
> // Output:
> // ++
> // |  parsedData|
> // ++
> // |[apple1, 1.0]|
> // |[apple2, 2.0]|
> // ++
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43201) Inconsistency between from_avro and from_json function

2023-04-19 Thread Philip Adetiloye (Jira)
Philip Adetiloye created SPARK-43201:


 Summary: Inconsistency between from_avro and from_json function
 Key: SPARK-43201
 URL: https://issues.apache.org/jira/browse/SPARK-43201
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Philip Adetiloye


Spark from_avro function does not allow schema to use dataframe column but 
takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""val
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", 
$"schema").as("parsedData"))parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984428#comment-15984428
 ] 

Philip Adetiloye commented on SPARK-20470:
--

[~srowen] I'm sorry, I added an example. 

> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> {code}
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> {code}
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> {code}
> Output JSON (Wrong)
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
> [
>   2.2925065264270024,
>   814.0,
>   0.15644850561960366
> ],
> [
>   2.448955032046606,
>   690.0,
>   0.1564485056196041
> ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-20470:
-
Description: 
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{code}
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}
{code}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)
{code}

Need to convert each row to json now and save to HBase 
{code}
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
{code}

Output JSON (Wrong)
{code}
{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}
{code}


  was:
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{code}
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}
{code}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)
{code}
Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)
{code}
{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}
{code}



> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> {code}
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> {code}
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> {code}
> Output JSON (Wrong)
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.13605802080739

[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-20470:
-
Description: 
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{code}
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}
{code}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)
{code}
Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)
{code}
{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}
{code}


  was:
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)
{code}
Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}




> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> {code}
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> Output JSON (Wrong)
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
>   

[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-20470:
-
Description: 
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)
{code}
Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}



  was:
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)

Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}




> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> Output JSON (Wrong)
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
> [
>   2.2925065264270024,
>   814.0,
>   0.156448505

[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-20470:
-
Description: 
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)

Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}



  was:
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

# root
#  |-- feature: string (nullable = true)
#  |-- histogram: array (nullable = true)
#  ||-- element: struct (containsNull = true)
#  |||-- start: double (nullable = true)
#  |||-- width: double (nullable = true)
#  |||-- y: double (nullable = true)

Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}




> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> Df schema looks good 
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> Need to convert each row to json now and save to HBase 
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> Output JSON (Wrong)
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
> [
>   2.2925065264270024,
>   814.0,
>   0.15644850561960366
> ],
> 

[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-20470:
-
Description: 
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

# root
#  |-- feature: string (nullable = true)
#  |-- histogram: array (nullable = true)
#  ||-- element: struct (containsNull = true)
#  |||-- start: double (nullable = true)
#  |||-- width: double (nullable = true)
#  |||-- y: double (nullable = true)

Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}



  was:
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.


rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(


> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> Df schema looks good 
> # root
> #  |-- feature: string (nullable = true)
> #  |-- histogram: array (nullable = true)
> #  ||-- element: struct (containsNull = true)
> #  |||-- start: double (nullable = true)
> #  |||-- width: double (nullable = true)
> #  |||-- y: double (nullable = true)
> Need to convert each row to json now and save to HBase 
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> Output JSON (Wrong)
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
> [
>   2.2925065264270024,
>   814.0,
>   0.15644850561960366
> ],
> [
>   2.448955032046606,
>   690.0,
>   0.1564485056196041
> ]
> }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)
Philip Adetiloye created SPARK-20470:


 Summary: Invalid json converting RDD row with Array of struct to 
json
 Key: SPARK-20470
 URL: https://issues.apache.org/jira/browse/SPARK-20470
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.3
Reporter: Philip Adetiloye


Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.


rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18363) Connected component for large graph result is wrong

2016-11-12 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15659686#comment-15659686
 ] 

Philip Adetiloye edited comment on SPARK-18363 at 11/12/16 1:47 PM:


duplicate graph vertice ID causes this issue


was (Author: pkadetiloye):
duplicated graph vertice ID causes this issue

> Connected component for large graph result is wrong
> ---
>
> Key: SPARK-18363
> URL: https://issues.apache.org/jira/browse/SPARK-18363
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.1
>Reporter: Philip Adetiloye
>
> The clustering done by Graphx connected component doesn't seems to work 
> correctly with large nodes.
> It only works correctly on a small graph



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18363) Connected component for large graph result is wrong

2016-11-12 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye closed SPARK-18363.

Resolution: Resolved

duplicated graph vertice ID causes this issue

> Connected component for large graph result is wrong
> ---
>
> Key: SPARK-18363
> URL: https://issues.apache.org/jira/browse/SPARK-18363
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.1
>Reporter: Philip Adetiloye
>
> The clustering done by Graphx connected component doesn't seems to work 
> correctly with large nodes.
> It only works correctly on a small graph



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18363) Connected component for large graph result is wrong

2016-11-09 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652009#comment-15652009
 ] 

Philip Adetiloye commented on SPARK-18363:
--

I logged a similar Issue with the graphframe but the problem exist also in 
Graphx

Basically I'm trying to cluster a hierarchical dataset. This works fine for 
small dataset, I could cluster the data into separate clusters. 

However, for large hierarchical dataset (about `1.60million vertices`) the 
result seems wrong. 
The resulting clusters from connected component have many intersections. This 
should not be the case. I expect the hierarchical dataset to be clustered into 
separate smaller clusters.

...
val vertices = universe.map(u => (u.id, u.username, u.age, u.gamescore))
  .toDF("id", "username", "age","gamescore")
  .alias("v")


val lookup = 
sparkSession.sparkContext.broadcast(universeMap.rdd.collectAsMap())


def buildEdges(src: String, dest: String) = {
Edge(lookup.value.get(src).get, lookup.value.get(dest).get, 0)
}


val edges  =  similarityDatasetNoJboss.mapPartitions(_.map(s => 
buildEdges(s.username1, s.username2)))
  .toDF("src", "dst", "default")

val graph = GraphFrame(vertices, edges)

val cc = graph.connectedComponents.run().select("id", "component")


Do some validation test

Select id, count(component)
group by id

I expect each`id` to belong to one cluster/component and count = 1 instead `id` 
belong to multiple clusters/component.

> Connected component for large graph result is wrong
> ---
>
> Key: SPARK-18363
> URL: https://issues.apache.org/jira/browse/SPARK-18363
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.1
>Reporter: Philip Adetiloye
>
> The clustering done by Graphx connected component doesn't seems to work 
> correctly with large nodes.
> It only works correctly on a small graph



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18363) Connected component for large graph result is wrong

2016-11-08 Thread Philip Adetiloye (JIRA)
Philip Adetiloye created SPARK-18363:


 Summary: Connected component for large graph result is wrong
 Key: SPARK-18363
 URL: https://issues.apache.org/jira/browse/SPARK-18363
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.0.1
Reporter: Philip Adetiloye


The clustering done by Graphx connected component doesn't seems to work 
correctly with large nodes.

It only works correctly on a small graph





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:17 PM:
--

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:


export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
 Phil


was (Author: pkadetiloye):
[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:


export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
> Environment: DEV
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang

[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:16 PM:
--

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:


export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil


was (Author: pkadetiloye):
[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

`
export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH

`
Hope this helps.

Thanks,
- Phil

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
> Environment: DEV
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang.Inte

[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:16 PM:
--

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

`
export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH

`
Hope this helps.

Thanks,
- Phil


was (Author: pkadetiloye):
[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
> Environment: DEV
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang.Inter

[jira] [Commented] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye commented on SPARK-9485:
-

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
> Environment: DEV
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
> :1326)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> a:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke

[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-9485:

Shepherd: MEN CHAMROEUN

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
> Environment: DEV
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
> :1326)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> a:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> java.lang.NullPointerException
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:193)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 

[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-9485:

Description: 
Spark-submit throws an exception when connecting to yarn but it works when  
used in standalone mode.

I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
got the same exception below.

spark-submit --master yarn-client

Here is a stack trace of the exception:

15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
executors
Exception in thread "Yarn application state monitor" 
org.apache.spark.SparkException: Error asking standalone schedule
r to shut down executors
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
d.scala:261)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
66)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
139)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
:1326)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
a:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.(SQLContext.scala:193)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at 

[jira] [Created] (SPARK-9485) Failed to connect to yarn

2015-07-30 Thread Philip Adetiloye (JIRA)
Philip Adetiloye created SPARK-9485:
---

 Summary: Failed to connect to yarn
 Key: SPARK-9485
 URL: https://issues.apache.org/jira/browse/SPARK-9485
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit, YARN
Affects Versions: 1.4.1
 Environment: DEV
Reporter: Philip Adetiloye
Priority: Minor


Spark-submit throws an exception when connecting to yarn but it works when  
used in standalone mode.

I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
got the same exception below.

Here is a stack trace of the exception:

15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
executors
Exception in thread "Yarn application state monitor" 
org.apache.spark.SparkException: Error asking standalone schedule
r to shut down executors
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
d.scala:261)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
66)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
139)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
:1326)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
a:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.(SQLContext.scala:193)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInt