[ https://issues.apache.org/jira/browse/SPARK-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng resolved SPARK-14274. ----------------------------------- Resolution: Fixed Issue resolved by pull request 12088 [https://github.com/apache/spark/pull/12088] > Add FileFormat.prepareRead to collect necessary global information > ------------------------------------------------------------------ > > Key: SPARK-14274 > URL: https://issues.apache.org/jira/browse/SPARK-14274 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 2.0.0 > Reporter: Cheng Lian > Assignee: Cheng Lian > Fix For: 2.0.0 > > > One problem of our newly introduced {{FileFormat.buildReader()}} method is > that it only sees pieces of input files. On the other hand, data sources like > CSV and LibSVM requires some sort of global information: > - CSV: the content of the header line if {{header}} option is set to true, so > that we can filter out header lines within each input file. This is > considered as a global information because it's possible that the header > appears in the middle of a file after blocks of comments and empty lines, > although this is just a rare/contrived corner case. > - LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset > to infer the total number of features to construct result {{LabeledPoint}} > instances. > Unfortunately, with our current API, this kind of global information can't be > gathered. > The solution proposed here is to add a {{prepareRead}} method, which accepts > the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which > contains an {{Option\[StructType\]}} for the inferred schema and a > {{Map\[String, Any\]}} for any gathered global information. This > {{ReadContext}} is then passed to {{buildReader()}}. By default, > {{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema > itself can be considered as a sort of global information). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org