[jira] [Updated] (SPARK-30006) printSchema indeterministic output

2019-11-23 Thread Hasil Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hasil Sharma updated SPARK-30006:
-
Description: 
printSchema doesn't give a consistent output in following example.

 
{code:python}
from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.appName("new-session").getOrCreate()
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = spark.sparkContext.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

df1 = spark.createDataFrame(people)

print(df1.printSchema())

df2 = df1.select("name", "age")

print(df2.printSchema())
{code}
 

first print outputs
{noformat}
root
|– age: long (nullable = true)
|– name: string (nullable = true)
{noformat}
 

second print outputs
{noformat}
root
|– name: string (nullable = true)
|– age: long (nullable = true)
{noformat}
Expectation: The output should be same because the column names are same.

  was:
printSchema doesn't give a consistent output in following example.

 
{code:python}
from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.appName("new-session").getOrCreate()
 l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
 rdd = spark.sparkContext.parallelize(l)
 people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

df1 = spark.createDataFrame(people_1)

print(df1.printSchema())

df2 = df1.select("name", "age")

print(df2.printSchema())
{code}
 

first print outputs
{noformat}
root
|– age: long (nullable = true)
|– name: string (nullable = true)
{noformat}
 

second print outputs
{noformat}
root
|– name: string (nullable = true)
|– age: long (nullable = true)
{noformat}
Expectation: The output should be same because the column names are same.


> printSchema indeterministic output
> --
>
> Key: SPARK-30006
> URL: https://issues.apache.org/jira/browse/SPARK-30006
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Hasil Sharma
>Priority: Minor
>
> printSchema doesn't give a consistent output in following example.
>  
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql import Row
> spark = SparkSession.builder.appName("new-session").getOrCreate()
> l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
> rdd = spark.sparkContext.parallelize(l)
> people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
> df1 = spark.createDataFrame(people)
> print(df1.printSchema())
> df2 = df1.select("name", "age")
> print(df2.printSchema())
> {code}
>  
> first print outputs
> {noformat}
> root
> |– age: long (nullable = true)
> |– name: string (nullable = true)
> {noformat}
>  
> second print outputs
> {noformat}
> root
> |– name: string (nullable = true)
> |– age: long (nullable = true)
> {noformat}
> Expectation: The output should be same because the column names are same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30006) printSchema indeterministic output

2019-11-23 Thread Hasil Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hasil Sharma updated SPARK-30006:
-
Description: 
printSchema doesn't give a consistent output in following example.

 
{code:python}
from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.appName("new-session").getOrCreate()
 l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
 rdd = spark.sparkContext.parallelize(l)
 people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

df1 = spark.createDataFrame(people_1)

print(df1.printSchema())

df2 = df1.select("name", "age")

print(df2.printSchema())
{code}
 

first print outputs
{noformat}
root
|– age: long (nullable = true)
|– name: string (nullable = true)
{noformat}
 

second print outputs
{noformat}
root
|– name: string (nullable = true)
|– age: long (nullable = true)
{noformat}
Expectation: The output should be same because the column names are same.

  was:
printSchema doesn't give a consistent output in following example.

 

```python

from pyspark.sql import SparkSession
 from pyspark.sql import Row

spark = SparkSession.builder.appName("new-session").getOrCreate()
 l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
 rdd = spark.sparkContext.parallelize(l)
 people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

df1 = spark.createDataFrame(people_1)

print(df1.printSchema())

df2 = df1.select("name", "age")

print(df2.printSchema())

```

 

first print outputs

```root
|– age: long (nullable = true)|
|– name: string (nullable = true)|```

 

second print outputs

```root
|– name: string (nullable = true)|
|– age: long (nullable = true)|```

Expectation: The output should be same because the column names are same.


> printSchema indeterministic output
> --
>
> Key: SPARK-30006
> URL: https://issues.apache.org/jira/browse/SPARK-30006
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Hasil Sharma
>Priority: Minor
>
> printSchema doesn't give a consistent output in following example.
>  
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql import Row
> spark = SparkSession.builder.appName("new-session").getOrCreate()
>  l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
>  rdd = spark.sparkContext.parallelize(l)
>  people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
> df1 = spark.createDataFrame(people_1)
> print(df1.printSchema())
> df2 = df1.select("name", "age")
> print(df2.printSchema())
> {code}
>  
> first print outputs
> {noformat}
> root
> |– age: long (nullable = true)
> |– name: string (nullable = true)
> {noformat}
>  
> second print outputs
> {noformat}
> root
> |– name: string (nullable = true)
> |– age: long (nullable = true)
> {noformat}
> Expectation: The output should be same because the column names are same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30006) printSchema indeterministic output

2019-11-23 Thread Hasil Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hasil Sharma updated SPARK-30006:
-
Description: 
printSchema doesn't give a consistent output in following example.

 

```python

from pyspark.sql import SparkSession
 from pyspark.sql import Row

spark = SparkSession.builder.appName("new-session").getOrCreate()
 l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
 rdd = spark.sparkContext.parallelize(l)
 people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

df1 = spark.createDataFrame(people_1)

print(df1.printSchema())

df2 = df1.select("name", "age")

print(df2.printSchema())

```

 

first print outputs

```root
|– age: long (nullable = true)|
|– name: string (nullable = true)|```

 

second print outputs

```root
|– name: string (nullable = true)|
|– age: long (nullable = true)|```

Expectation: The output should be same because the column names are same.

  was:
printSchema doesn't give a consistent output in following example.

 

```python

from pyspark.sql import SparkSession
 from pyspark.sql import Row

spark = SparkSession.builder.appName("new-session").getOrCreate()
 l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
 rdd = spark.sparkContext.parallelize(l)
 people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

df1 = spark.createDataFrame(people_1)

print(df1.printSchema())

df2 = df1.select("name", "age")

print(df2.printSchema())```

 

first print outputs

```
 
```

 

second print outputs

```

root
|– name: string (nullable = true)|
|– age: long (nullable = true)|

```

Expectation: The output should be same because the column names are same.


> printSchema indeterministic output
> --
>
> Key: SPARK-30006
> URL: https://issues.apache.org/jira/browse/SPARK-30006
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Hasil Sharma
>Priority: Minor
>
> printSchema doesn't give a consistent output in following example.
>  
> ```python
> from pyspark.sql import SparkSession
>  from pyspark.sql import Row
> spark = SparkSession.builder.appName("new-session").getOrCreate()
>  l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
>  rdd = spark.sparkContext.parallelize(l)
>  people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
> df1 = spark.createDataFrame(people_1)
> print(df1.printSchema())
> df2 = df1.select("name", "age")
> print(df2.printSchema())
> ```
>  
> first print outputs
> ```root
> |– age: long (nullable = true)|
> |– name: string (nullable = true)|```
>  
> second print outputs
> ```root
> |– name: string (nullable = true)|
> |– age: long (nullable = true)|```
> Expectation: The output should be same because the column names are same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30006) printSchema indeterministic output

2019-11-23 Thread Hasil Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hasil Sharma updated SPARK-30006:
-
Description: 
printSchema doesn't give a consistent output in following example.

 

```python

from pyspark.sql import SparkSession
 from pyspark.sql import Row

spark = SparkSession.builder.appName("new-session").getOrCreate()
 l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
 rdd = spark.sparkContext.parallelize(l)
 people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

df1 = spark.createDataFrame(people_1)

print(df1.printSchema())

df2 = df1.select("name", "age")

print(df2.printSchema())```

 

first print outputs

```
 
```

 

second print outputs

```

root
|– name: string (nullable = true)|
|– age: long (nullable = true)|

```

Expectation: The output should be same because the column names are same.

  was:
printSchema doesn't give a consistent output in following example.

 

```python

from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.appName("new-session").getOrCreate()
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = spark.sparkContext.parallelize(l)
people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

df1 = spark.createDataFrame(people_1)

print(df1.printSchema())

df2 = df1.select("name", "age")

print(df2.printSchema())

```

 

first print outputs

```

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

```

 

second print outputs

```

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

```

Expectation: The output should be same because the column names are same.


> printSchema indeterministic output
> --
>
> Key: SPARK-30006
> URL: https://issues.apache.org/jira/browse/SPARK-30006
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Hasil Sharma
>Priority: Minor
>
> printSchema doesn't give a consistent output in following example.
>  
> ```python
> from pyspark.sql import SparkSession
>  from pyspark.sql import Row
> spark = SparkSession.builder.appName("new-session").getOrCreate()
>  l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
>  rdd = spark.sparkContext.parallelize(l)
>  people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
> df1 = spark.createDataFrame(people_1)
> print(df1.printSchema())
> df2 = df1.select("name", "age")
> print(df2.printSchema())```
>  
> first print outputs
> ```
>  
> ```
>  
> second print outputs
> ```
> root
> |– name: string (nullable = true)|
> |– age: long (nullable = true)|
> ```
> Expectation: The output should be same because the column names are same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30006) printSchema indeterministic output

2019-11-23 Thread Hasil Sharma (Jira)
Hasil Sharma created SPARK-30006:


 Summary: printSchema indeterministic output
 Key: SPARK-30006
 URL: https://issues.apache.org/jira/browse/SPARK-30006
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Hasil Sharma


printSchema doesn't give a consistent output in following example.

 

```python

from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.appName("new-session").getOrCreate()
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = spark.sparkContext.parallelize(l)
people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

df1 = spark.createDataFrame(people_1)

print(df1.printSchema())

df2 = df1.select("name", "age")

print(df2.printSchema())

```

 

first print outputs

```

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

```

 

second print outputs

```

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

```

Expectation: The output should be same because the column names are same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12436) If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType

2016-08-07 Thread Hasil Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410946#comment-15410946
 ] 

Hasil Sharma commented on SPARK-12436:
--

Is this issue solved ? If not, would like to contribute

> If all values of a JSON field is null, JSON's inferSchema should return 
> NullType instead of StringType
> --
>
> Key: SPARK-12436
> URL: https://issues.apache.org/jira/browse/SPARK-12436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: starter
>
> Right now, JSON's inferSchema will return {{StringType}} for a field that 
> always has null values or an {{ArrayType(StringType)}}  for a field that 
> always has empty array values. Although this behavior makes writing JSON data 
> to other data sources easy (i.e. when writing data, we do not need to remove 
> those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream 
> application hard to reason about the actual schema of the data and thus makes 
> schema merging hard. We should allow JSON's inferSchema returns {{NullType}} 
> and {{ArrayType(NullType)}}. Also, we need to make sure that when we write 
> data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} 
> columns first. 
> Besides  {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same 
> thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). 
> To finish this work, we need to finish the following sub-tasks:
> * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
> * Determine whether we need to add the operation of removing {{NullType}} and 
> {{ArrayType(NullType)}} columns from the data that will be write out for all 
> data sources (i.e. data sources based our data source API and Hive tables). 
> Or, we should just add this operation for certain data sources (e.g. 
> Parquet). For example, we may not need this operation for Hive because Hive 
> has VoidObjectInspector.
> * Implement the change and get it merged to Spark master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org