[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-03-23 Thread Nathan Howell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939572#comment-15939572
 ] 

Nathan Howell commented on SPARK-19641:
---

Please pick it up if you have cycles and want to take it over, otherwise I'll 
get to it later next week. Thanks!

> JSON schema inference in DROPMALFORMED mode produces incorrect schema
> -
>
> Key: SPARK-19641
> URL: https://issues.apache.org/jira/browse/SPARK-19641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nathan Howell
>
> In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
> columns. This occurs when one document contains a valid JSON value (such as a 
> string or number) and the other documents contain objects or arrays.
> When the default case in {{JsonInferSchema.compatibleRootType}} is reached 
> when merging a {{StringType}} and a {{StructType}} the resulting type will be 
> a {{StringType}}, which is then discarded because a {{StructType}} is 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-03-23 Thread Nathan Howell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939547#comment-15939547
 ] 

Nathan Howell commented on SPARK-19641:
---

[~hyukjin.kwon], I'm super busy through next Tuesday. I can get it open it 
before then but probably won't have time to do any work on it until later in 
the week. Are you trying to get this in before the 2.2 branch?

> JSON schema inference in DROPMALFORMED mode produces incorrect schema
> -
>
> Key: SPARK-19641
> URL: https://issues.apache.org/jira/browse/SPARK-19641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nathan Howell
>
> In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
> columns. This occurs when one document contains a valid JSON value (such as a 
> string or number) and the other documents contain objects or arrays.
> When the default case in {{JsonInferSchema.compatibleRootType}} is reached 
> when merging a {{StringType}} and a {{StructType}} the resulting type will be 
> a {{StringType}}, which is then discarded because a {{StructType}} is 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-02-16 Thread Nathan Howell (JIRA)
Nathan Howell created SPARK-19641:
-

 Summary: JSON schema inference in DROPMALFORMED mode produces 
incorrect schema
 Key: SPARK-19641
 URL: https://issues.apache.org/jira/browse/SPARK-19641
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Nathan Howell


In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
columns. This occurs when one document contains a valid JSON value (such as a 
string or number) and the other documents contain objects or arrays.

When the default case in {{JsonInferSchema.compatibleRootType}} is reached when 
merging a {{StringType}} and a {{StructType}} the resulting type will be a 
{{StringType}}, which is then discarded because a {{StructType}} is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18772) Parsing JSON with some NaN and Infinity values throws NumberFormatException

2016-12-07 Thread Nathan Howell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Howell updated SPARK-18772:
--
Affects Version/s: 2.0.2

> Parsing JSON with some NaN and Infinity values throws NumberFormatException
> ---
>
> Key: SPARK-18772
> URL: https://issues.apache.org/jira/browse/SPARK-18772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Priority: Minor
>
> JacksonParser tests for infinite and NaN values in a way that is not 
> supported by the underlying float/double parser. For example, the input 
> string is always lowercased to check for {{-Infinity}} but the parser only 
> supports titlecased values. So a {{-infinitY}} will pass the test but fail 
> with a {{NumberFormatException}} when parsing. This exception is not caught 
> anywhere and the task ends up failing.
> A related issue is that the code checks for {{Inf}} but the parser only 
> supports the long form of {{Infinity}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18772) Parsing JSON with some NaN and Infinity values throws NumberFormatException

2016-12-07 Thread Nathan Howell (JIRA)
Nathan Howell created SPARK-18772:
-

 Summary: Parsing JSON with some NaN and Infinity values throws 
NumberFormatException
 Key: SPARK-18772
 URL: https://issues.apache.org/jira/browse/SPARK-18772
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Nathan Howell
Priority: Minor


JacksonParser tests for infinite and NaN values in a way that is not supported 
by the underlying float/double parser. For example, the input string is always 
lowercased to check for {{-Infinity}} but the parser only supports titlecased 
values. So a {{-infinitY}} will pass the test but fail with a 
{{NumberFormatException}} when parsing. This exception is not caught anywhere 
and the task ends up failing.
A related issue is that the code checks for {{Inf}} but the parser only 
supports the long form of {{Infinity}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory

2016-11-30 Thread Nathan Howell (JIRA)
Nathan Howell created SPARK-18658:
-

 Summary: Writing to a text DataSource buffers one or more lines in 
memory
 Key: SPARK-18658
 URL: https://issues.apache.org/jira/browse/SPARK-18658
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.2
Reporter: Nathan Howell
Priority: Minor


The JSON and CSV writing paths buffer entire lines (or multiple lines) in 
memory prior to writing to disk. For large rows this is inefficient. It may 
make sense to skip the {{TextOutputFormat}} record writer and go directly to 
the underlying {{FSDataOutputStream}}, allowing the writers to append arbitrary 
byte arrays (fractions of a row) instead of a full row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18654) JacksonParser.makeRootConverter has effectively unreachable code

2016-11-30 Thread Nathan Howell (JIRA)
Nathan Howell created SPARK-18654:
-

 Summary: JacksonParser.makeRootConverter has effectively 
unreachable code
 Key: SPARK-18654
 URL: https://issues.apache.org/jira/browse/SPARK-18654
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.2
Reporter: Nathan Howell
Priority: Minor


{{JacksonParser.makeRootConverter}} currently takes a {{DataType}} but is only 
called with a {{StructType}}. Revising the method to only accept a 
{{StructType}} allows us to remove some pattern matches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-29 Thread Nathan Howell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15705940#comment-15705940
 ] 

Nathan Howell commented on SPARK-18352:
---

Got hung up on some other stuff, haven't been able to get back to adding tests 
yet. WIP code is up here: 
https://github.com/NathanHowell/spark/commits/SPARK-18352

Question though. https://github.com/apache/spark/pull/15813 touches a bunch of 
areas I was also working on. Do you think this patch will land soon? Should I 
rework mine on top?

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Nathan Howell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675966#comment-15675966
 ] 

Nathan Howell commented on SPARK-18352:
---

Sounds good to me. I have an implementation that's passing basic tests but 
needs to be cleaned up a bit. I'll get a pull request up in the next few days.

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Nathan Howell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675421#comment-15675421
 ] 

Nathan Howell commented on SPARK-18352:
---

Do you have any ideas how to support this? {{DataFrameReader.schema}} currently 
takes a {{StructType}} and the existing row level json reader flattens arrays 
out to support this restriction.

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Nathan Howell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675386#comment-15675386
 ] 

Nathan Howell commented on SPARK-18352:
---

Any opinions on configuring this with an option instead of a creating a new 
data source? It looks fairly straightforward to support this as an option. E.g.:

{code}
// parse one json value per line
// this would be the default behavior, for backwards compatibility
spark.read.option("recordDelimiter", "line").json(???)

// parse one json value per file
spark.read.option("recordDelimiter", "file").json(???)
{code}

The refactoring work would be the same in either case, but it would require 
less plumbing for Python/Java/etc to enable this with an option.

As an aside... it also is straightforward to extend this to support {{Text}} 
and {{UTF8String}} values directly, avoiding a string conversion of the entire 
column prior to parsing.

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10064) Decision tree continuous feature binning is slow in large feature spaces

2015-08-17 Thread Nathan Howell (JIRA)
Nathan Howell created SPARK-10064:
-

 Summary: Decision tree continuous feature binning is slow in large 
feature spaces
 Key: SPARK-10064
 URL: https://issues.apache.org/jira/browse/SPARK-10064
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Nathan Howell


When working with large feature spaces and high bin counts (500) the binning 
process can take many hours. This is particularly painful because it ties up 
executors for the duration, which is not shared-cluster friendly.

The binning process can and should be performed on the executors instead of the 
driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9618) SQLContext.read.schema().parquet() ignores the supplied schema

2015-08-04 Thread Nathan Howell (JIRA)
Nathan Howell created SPARK-9618:


 Summary: SQLContext.read.schema().parquet() ignores the supplied 
schema
 Key: SPARK-9618
 URL: https://issues.apache.org/jira/browse/SPARK-9618
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Nathan Howell
Priority: Minor


If a user supplies a schema when loading a Parquet file it is ignored and the 
schema is read off disk instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9617) Implement json_tuple

2015-08-04 Thread Nathan Howell (JIRA)
Nathan Howell created SPARK-9617:


 Summary: Implement json_tuple
 Key: SPARK-9617
 URL: https://issues.apache.org/jira/browse/SPARK-9617
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Nathan Howell
Priority: Minor


Provide a native Spark implementation for {{json_tuple}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8278) Remove deprecated JsonRDD functionality

2015-06-09 Thread Nathan Howell (JIRA)
Nathan Howell created SPARK-8278:


 Summary: Remove deprecated JsonRDD functionality
 Key: SPARK-8278
 URL: https://issues.apache.org/jira/browse/SPARK-8278
 Project: Spark
  Issue Type: Story
Reporter: Nathan Howell
Priority: Minor


The old JSON functionality (deprecated in 1.4) needs to be removed for 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3858) SchemaRDD.generate ignores alias argument

2014-10-08 Thread Nathan Howell (JIRA)
Nathan Howell created SPARK-3858:


 Summary: SchemaRDD.generate ignores alias argument
 Key: SPARK-3858
 URL: https://issues.apache.org/jira/browse/SPARK-3858
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Nathan Howell
Priority: Minor


The {{alias}} argument to {{SchemaRDD.generate}} is discarded and a constant 
{{None}} is supplied to the {{logical.Generate}} constructor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2876) RDD.partitionBy loads entire partition into memory

2014-08-06 Thread Nathan Howell (JIRA)
Nathan Howell created SPARK-2876:


 Summary: RDD.partitionBy loads entire partition into memory
 Key: SPARK-2876
 URL: https://issues.apache.org/jira/browse/SPARK-2876
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
Reporter: Nathan Howell


{{RDD.partitionBy}} fails due to an OOM in the PySpark daemon process when 
given a relatively large dataset. It seems that the use of 
{{BatchedSerializer(UNLIMITED_BATCH_SIZE)}} is suspect, most other RDD methods 
use {{self._jrdd_deserializer}}.

{code}
y = x.keyBy(...)
z = y.partitionBy(512) # fails
z = y.repartition(512) # succeeds
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org