[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939572#comment-15939572 ] Nathan Howell commented on SPARK-19641: --- Please pick it up if you have cycles and want to take it over, otherwise I'll get to it later next week. Thanks! > JSON schema inference in DROPMALFORMED mode produces incorrect schema > - > > Key: SPARK-19641 > URL: https://issues.apache.org/jira/browse/SPARK-19641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nathan Howell > > In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no > columns. This occurs when one document contains a valid JSON value (such as a > string or number) and the other documents contain objects or arrays. > When the default case in {{JsonInferSchema.compatibleRootType}} is reached > when merging a {{StringType}} and a {{StructType}} the resulting type will be > a {{StringType}}, which is then discarded because a {{StructType}} is > expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939547#comment-15939547 ] Nathan Howell commented on SPARK-19641: --- [~hyukjin.kwon], I'm super busy through next Tuesday. I can get it open it before then but probably won't have time to do any work on it until later in the week. Are you trying to get this in before the 2.2 branch? > JSON schema inference in DROPMALFORMED mode produces incorrect schema > - > > Key: SPARK-19641 > URL: https://issues.apache.org/jira/browse/SPARK-19641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nathan Howell > > In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no > columns. This occurs when one document contains a valid JSON value (such as a > string or number) and the other documents contain objects or arrays. > When the default case in {{JsonInferSchema.compatibleRootType}} is reached > when merging a {{StringType}} and a {{StructType}} the resulting type will be > a {{StringType}}, which is then discarded because a {{StructType}} is > expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema
Nathan Howell created SPARK-19641: - Summary: JSON schema inference in DROPMALFORMED mode produces incorrect schema Key: SPARK-19641 URL: https://issues.apache.org/jira/browse/SPARK-19641 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Nathan Howell In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no columns. This occurs when one document contains a valid JSON value (such as a string or number) and the other documents contain objects or arrays. When the default case in {{JsonInferSchema.compatibleRootType}} is reached when merging a {{StringType}} and a {{StructType}} the resulting type will be a {{StringType}}, which is then discarded because a {{StructType}} is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18772) Parsing JSON with some NaN and Infinity values throws NumberFormatException
[ https://issues.apache.org/jira/browse/SPARK-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Howell updated SPARK-18772: -- Affects Version/s: 2.0.2 > Parsing JSON with some NaN and Infinity values throws NumberFormatException > --- > > Key: SPARK-18772 > URL: https://issues.apache.org/jira/browse/SPARK-18772 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Nathan Howell >Priority: Minor > > JacksonParser tests for infinite and NaN values in a way that is not > supported by the underlying float/double parser. For example, the input > string is always lowercased to check for {{-Infinity}} but the parser only > supports titlecased values. So a {{-infinitY}} will pass the test but fail > with a {{NumberFormatException}} when parsing. This exception is not caught > anywhere and the task ends up failing. > A related issue is that the code checks for {{Inf}} but the parser only > supports the long form of {{Infinity}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18772) Parsing JSON with some NaN and Infinity values throws NumberFormatException
Nathan Howell created SPARK-18772: - Summary: Parsing JSON with some NaN and Infinity values throws NumberFormatException Key: SPARK-18772 URL: https://issues.apache.org/jira/browse/SPARK-18772 Project: Spark Issue Type: Bug Components: SQL Reporter: Nathan Howell Priority: Minor JacksonParser tests for infinite and NaN values in a way that is not supported by the underlying float/double parser. For example, the input string is always lowercased to check for {{-Infinity}} but the parser only supports titlecased values. So a {{-infinitY}} will pass the test but fail with a {{NumberFormatException}} when parsing. This exception is not caught anywhere and the task ends up failing. A related issue is that the code checks for {{Inf}} but the parser only supports the long form of {{Infinity}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory
Nathan Howell created SPARK-18658: - Summary: Writing to a text DataSource buffers one or more lines in memory Key: SPARK-18658 URL: https://issues.apache.org/jira/browse/SPARK-18658 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.2 Reporter: Nathan Howell Priority: Minor The JSON and CSV writing paths buffer entire lines (or multiple lines) in memory prior to writing to disk. For large rows this is inefficient. It may make sense to skip the {{TextOutputFormat}} record writer and go directly to the underlying {{FSDataOutputStream}}, allowing the writers to append arbitrary byte arrays (fractions of a row) instead of a full row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18654) JacksonParser.makeRootConverter has effectively unreachable code
Nathan Howell created SPARK-18654: - Summary: JacksonParser.makeRootConverter has effectively unreachable code Key: SPARK-18654 URL: https://issues.apache.org/jira/browse/SPARK-18654 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.2 Reporter: Nathan Howell Priority: Minor {{JacksonParser.makeRootConverter}} currently takes a {{DataType}} but is only called with a {{StructType}}. Revising the method to only accept a {{StructType}} allows us to remove some pattern matches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15705940#comment-15705940 ] Nathan Howell commented on SPARK-18352: --- Got hung up on some other stuff, haven't been able to get back to adding tests yet. WIP code is up here: https://github.com/NathanHowell/spark/commits/SPARK-18352 Question though. https://github.com/apache/spark/pull/15813 touches a bunch of areas I was also working on. Do you think this patch will land soon? Should I rework mine on top? > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675966#comment-15675966 ] Nathan Howell commented on SPARK-18352: --- Sounds good to me. I have an implementation that's passing basic tests but needs to be cleaned up a bit. I'll get a pull request up in the next few days. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675421#comment-15675421 ] Nathan Howell commented on SPARK-18352: --- Do you have any ideas how to support this? {{DataFrameReader.schema}} currently takes a {{StructType}} and the existing row level json reader flattens arrays out to support this restriction. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675386#comment-15675386 ] Nathan Howell commented on SPARK-18352: --- Any opinions on configuring this with an option instead of a creating a new data source? It looks fairly straightforward to support this as an option. E.g.: {code} // parse one json value per line // this would be the default behavior, for backwards compatibility spark.read.option("recordDelimiter", "line").json(???) // parse one json value per file spark.read.option("recordDelimiter", "file").json(???) {code} The refactoring work would be the same in either case, but it would require less plumbing for Python/Java/etc to enable this with an option. As an aside... it also is straightforward to extend this to support {{Text}} and {{UTF8String}} values directly, avoiding a string conversion of the entire column prior to parsing. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10064) Decision tree continuous feature binning is slow in large feature spaces
Nathan Howell created SPARK-10064: - Summary: Decision tree continuous feature binning is slow in large feature spaces Key: SPARK-10064 URL: https://issues.apache.org/jira/browse/SPARK-10064 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.1 Reporter: Nathan Howell When working with large feature spaces and high bin counts (500) the binning process can take many hours. This is particularly painful because it ties up executors for the duration, which is not shared-cluster friendly. The binning process can and should be performed on the executors instead of the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9618) SQLContext.read.schema().parquet() ignores the supplied schema
Nathan Howell created SPARK-9618: Summary: SQLContext.read.schema().parquet() ignores the supplied schema Key: SPARK-9618 URL: https://issues.apache.org/jira/browse/SPARK-9618 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1 Reporter: Nathan Howell Priority: Minor If a user supplies a schema when loading a Parquet file it is ignored and the schema is read off disk instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9617) Implement json_tuple
Nathan Howell created SPARK-9617: Summary: Implement json_tuple Key: SPARK-9617 URL: https://issues.apache.org/jira/browse/SPARK-9617 Project: Spark Issue Type: Improvement Components: SQL Reporter: Nathan Howell Priority: Minor Provide a native Spark implementation for {{json_tuple}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8278) Remove deprecated JsonRDD functionality
Nathan Howell created SPARK-8278: Summary: Remove deprecated JsonRDD functionality Key: SPARK-8278 URL: https://issues.apache.org/jira/browse/SPARK-8278 Project: Spark Issue Type: Story Reporter: Nathan Howell Priority: Minor The old JSON functionality (deprecated in 1.4) needs to be removed for 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3858) SchemaRDD.generate ignores alias argument
Nathan Howell created SPARK-3858: Summary: SchemaRDD.generate ignores alias argument Key: SPARK-3858 URL: https://issues.apache.org/jira/browse/SPARK-3858 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Nathan Howell Priority: Minor The {{alias}} argument to {{SchemaRDD.generate}} is discarded and a constant {{None}} is supplied to the {{logical.Generate}} constructor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2876) RDD.partitionBy loads entire partition into memory
Nathan Howell created SPARK-2876: Summary: RDD.partitionBy loads entire partition into memory Key: SPARK-2876 URL: https://issues.apache.org/jira/browse/SPARK-2876 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.1 Reporter: Nathan Howell {{RDD.partitionBy}} fails due to an OOM in the PySpark daemon process when given a relatively large dataset. It seems that the use of {{BatchedSerializer(UNLIMITED_BATCH_SIZE)}} is suspect, most other RDD methods use {{self._jrdd_deserializer}}. {code} y = x.keyBy(...) z = y.partitionBy(512) # fails z = y.repartition(512) # succeeds {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org