[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182586#comment-17182586 ] chanduhawk edited comment on SPARK-32614 at 8/23/20, 5:20 AM: -- [~srowen] Univocity parser always takes default comment character as #. It seems spark updates the comment settings to \u character. Please see the https://github.com/uniVocity/univocity-parsers/pull/412 that raised which involves adding one new option which enable/disable the comment processing. Currently as per the PR still I think there should be an option which enable or disable the comment processing in spark CVS so that the parameter Boolean value can be passed to univocity parser As per PR If we will change that way then it might impact existing users for which \u is a comment character by default. So I would say a separate optional config is a better solution. What I am saying here is that we need to wait for univocity 3.0.0 to be available where the new changes will be available then we can add spark changes in a proper manner. was (Author: chanduhawk): [~srowen] Univocity parser always takes default comment character as #. It seems spark updates the comment settings to \u character. Please see the https://github.com/uniVocity/univocity-parsers/pull/412 that raised which involves adding one new option which enable/disable the comment processing. Currently as per the PR still I think there should be an option which enable or disable the comment processing in spark CVS so that the parameter Boolean value can be passed to univocity parser > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.3, 2.4.5, 3.0.0 >Reporter: chanduhawk >Priority: Minor > Labels: correctness > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the below error. > *eg: val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false);* > *+TestData+* > > !screenshot-1.png! > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > *Note:* > Though its the limitation of the univocity parser and the workaround is to > provide any other comment character by mentioning .option("comment","#"), but > if my actual data starts with this character then the particular row will be > discarded. > Currently I pushed the code in univocity parser to handle this scenario as > part of the below PR > https://github.com/uniVocity/univocity-parsers/pull/412 > please accept the jira so that we can enable this feature in spark-csv by > adding a parameter in spark csvoptions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179150#comment-17179150 ] chanduhawk edited comment on SPARK-32614 at 8/22/20, 4:52 PM: -- [~srowen] In spark if the comment option is not set then it will take \u as default comment character. So in our case the data file, few rows starts with \u character, so it will ignore those rows. so currently in spark csv options there is no option present to disable processing the comment characters at all so that it will treat all rows as valid rows. You can recreate this issue by using the data file screen shot that i attached in the jira. please try to process the csv file and keep any row that starts with \u was (Author: chanduhawk): In spark if the comment option is not set then it will take \u as default comment character. So in our case the data file, few rows starts with \u character, so it will ignore those rows. so currently in spark csv options there is no option present to disable processing the comment characters at all so that it will treat all rows as valid rows. You can recreate this issue by using the data file screen shot that i attached in the jira. please try to process the csv file and keep any row that starts with \u > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.3, 2.4.5, 3.0.0 >Reporter: chanduhawk >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the below error. > *eg: val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false);* > *+TestData+* > > !screenshot-1.png! > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > *Note:* > Though its the limitation of the univocity parser and the workaround is to > provide any other comment character by mentioning .option("comment","#"), but > if my actual data starts with this character then the particular row will be > discarded. > Currently I pushed the code in univocity parser to handle this scenario as > part of the below PR > https://github.com/uniVocity/univocity-parsers/pull/412 > please accept the jira so that we can enable this feature in spark-csv by > adding a parameter in spark csvoptions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179150#comment-17179150 ] chanduhawk edited comment on SPARK-32614 at 8/17/20, 5:52 PM: -- In spark if the comment option is not set then it will take \u as default comment character. So in our case the data file, few rows starts with \u character, so it will ignore those rows. so currently in spark csv options there is no option present to disable processing the comment characters at all so that it will treat all rows as valid rows. You can recreate this issue by using the data file screen shot that i attached in the jira. please try to process the csv file and keep any row that starts with \u was (Author: chanduhawk): In spark if the comment option is not set then it will take \u as default comment character. So in our case the data file, few rows starts with \u character, so it will ignore those rows. so currently in spark csv options there is no option present to disable processing the comment characters at all so that it will treat all rows as valid rows. You can recreate this issue by using the data file screen shot that i attached in the jira. > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.3, 2.4.5, 3.0.0 >Reporter: chanduhawk >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the below error. > *eg: val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false);* > *+TestData+* > > !screenshot-1.png! > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > *Note:* > Though its the limitation of the univocity parser and the workaround is to > provide any other comment character by mentioning .option("comment","#"), but > if my actual data starts with this character then the particular row will be > discarded. > Currently I pushed the code in univocity parser to handle this scenario as > part of the below PR > https://github.com/uniVocity/univocity-parsers/pull/412 > please accept the jira so that we can enable this feature in spark-csv by > adding a parameter in spark csvoptions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179097#comment-17179097 ] chanduhawk edited comment on SPARK-32614 at 8/17/20, 4:25 PM: -- [~srowen] yes the row that starts with null character is not a comment character in actual data warehouse projects. We have to process all rows irrespective of any startign chars. Univocity parser i already raised PR and its merged successfully which will enable the option whether to process comment characters or not. https://github.com/uniVocity/univocity-parsers/pull/412 in spark we need to add the respective option in CSVoptions class and csvutils class *proposed changes in spark* val df = spark.read.option("delimiter",",").option("*processComments*","false").csv("file:/E:/Data/Testdata.dat"); this *processComments *option when set to false spark should not check for any comment characters and will process all rows even if it started with null or any other comment character. Please let me know if I can raise a PR on this, once this enhancement is accepted was (Author: chanduhawk): [~srowen] yes the row that starts with null character is not a comment character in actual data warehouse projects. We have to process all rows irrespective of any startign chars. Univocity parser i already raised PR and its merged successfully which will enable the option whether to process comment characters or not. https://github.com/uniVocity/univocity-parsers/pull/412 in spark we need to add the respective option in CSVoptions class and csvutils class proposed changes in spark val df = spark.read.option("delimiter",",").option("*processComments*","false").csv("file:/E:/Data/Testdata.dat"); this *processComments *option when set to false spark should not check for any comment characters and will process all rows even if it started with null or any other comment character. Please let me know if I can raise a PR on this, once this enhancement is accepted > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: chanduhawk >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the below error. > *eg: *val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > *Note:* > Though its the limitation of the univocity parser and the workaround is to > provide any other comment character by mentioning .option("comment","#"), but > if my actual data starts with this character then the particular row will be > discarded. > Currently I pushed the code in univocity parser to handle this scenario as > part of the below PR > https://github.com/uniVocity/univocity-parsers/pull/412 > please accept the jira so that we can enable this feature in spark-csv by > adding a parameter in spark csvoptions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179097#comment-17179097 ] chanduhawk edited comment on SPARK-32614 at 8/17/20, 4:24 PM: -- [~srowen] yes the row that starts with null character is not a comment character in actual data warehouse projects. We have to process all rows irrespective of any startign chars. Univocity parser i already raised PR and its merged successfully which will enable the option whether to process comment characters or not. https://github.com/uniVocity/univocity-parsers/pull/412 in spark we need to add the respective option in CSVoptions class and csvutils class proposed changes in spark val df = spark.read.option("delimiter",",").option("*processComments*","false").csv("file:/E:/Data/Testdata.dat"); this *processComments *option when set to false spark should not check for any comment characters and will process all rows even if it started with null or any other comment character. Please let me know if I can raise a PR on this, once this enhancement is accepted was (Author: chanduhawk): [~srowen] Univocity parser i already raised PR and its merged successfully which will enable the option whether to process comment characters or not. https://github.com/uniVocity/univocity-parsers/pull/412 in spark we need to add the respective option in CSVoptions class and csvutils class proposed changes in spark val df = spark.read.option("delimiter",",").option("*processComments*","false").csv("file:/E:/Data/Testdata.dat"); this *processComments *option when set to false spark should not check for any comment characters and will process all rows even if it started with null or any other comment character. Please let me know if I can raise a PR on this, once this enhancement is accepted > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: chanduhawk >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the below error. > *eg: *val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > *Note:* > Though its the limitation of the univocity parser and the workaround is to > provide any other comment character by mentioning .option("comment","#"), but > if my actual data starts with this character then the particular row will be > discarded. > Currently I pushed the code in univocity parser to handle this scenario as > part of the below PR > https://github.com/uniVocity/univocity-parsers/pull/412 > please accept the jira so that we can enable this feature in spark-csv by > adding a parameter in spark csvoptions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178560#comment-17178560 ] chanduhawk edited comment on SPARK-32614 at 8/17/20, 3:49 AM: -- [~srowen] *currently spark cannt process the row that starts with null character.* If one of the rows the data file(CSV) starts with null or \u character like below(PFA screenshot) like below *null*,abc,test then spark will throw the error as mentioned in the description. i.e spark cannt process any row that starts with null character. It can only process the row if we set the options like below option("comment","a character") *comment* - it will take a character that needs to be treated as comment character so spark wont process the row that starts with this character The above is a work around to process the row that starts with null. But this process is also flawed and subject to skip a valid row of data that may start with the comment character. In data ware house most of the time we dont have comment charaters concept and all the rows needs to be processed. So there should be an option in spark which will disable the processing of comment characters like below option("enableProcessingComments", false) this option will disable checking for any comment character processing and can also process the rows that starts with the null or \u character was (Author: chanduhawk): *currently spark cannt process the row that starts with null character.* If one of the rows the data file(CSV) starts with null or \u character like below(PFA screenshot) like below *null*,abc,test then spark will throw the error as mentioned in the description. i.e spark cannt process any row that starts with null character. It can only process the row if we set the options like below option("comment","a character") *comment* - it will take a character that needs to be treated as comment character so spark wont process the row that starts with this character The above is a work around to process the row that starts with null. But this process is also flawed and subject to skip a valid row of data that may start with the comment character. In data ware house most of the time we dont have comment charaters concept and all the rows needs to be processed. So there should be an option in spark which will disable the processing of comment characters like below option("enableProcessingComments", false) this option will disable checking for any comment character processing and can also process the rows that starts with the null or \u character > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: chanduhawk >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the below error. > *eg: *val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > *Note:* > Though its the limitation of the univocity parser and the workaround is to > provide any other comment character by mentioning .option("comment","#"), but > if my actual data starts with this character then the
[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178560#comment-17178560 ] chanduhawk edited comment on SPARK-32614 at 8/16/20, 5:39 PM: -- *currently spark cannt process the row that starts with null character.* If one of the rows the data file(CSV) starts with null or \u character like below(PFA screenshot) like below *null*,abc,test then spark will throw the error as mentioned in the description. i.e spark cannt process any row that starts with null character. It can only process the row if we set the options like below option("comment","a character") *comment* - it will take a character that needs to be treated as comment character so spark wont process the row that starts with this character The above is a work around to process the row that starts with null. But this process is also flawed and subject to skip a valid row of data that may start with the comment character. In data ware house most of the time we dont have comment charaters concept and all the rows needs to be processed. So there should be an option in spark which will disable the processing of comment characters like below option("enableProcessingComments", false) this option will disable checking for any comment character processing and can also process the rows that starts with the null or \u character was (Author: chanduhawk): *currently spark cannt process the row that starts with null character.* If one of the rows the data file(CSV) starts with null or \u character like below(PFA screenshot) like below *null*,abc,test then spark will throw the error as mentioned in the description. i.e spark cannt process any row that starts with null character. It can only process the row if we set the options like below option("comment","a character") comment - it will take a character that needs to be treated as comment character so spark wont process the row that starts with this character The above is a work around to process the row that starts with null. But this process is also flawed and subject to skip a valid row of data that may start with the comment character. In data ware house most of the time we dont have comment charatcers concept and all the rows needs to be processed. So there should be an option in spark which will disable the processing of comment characters like below option("enableProcessingComments", false) this option will disable checking for any comment character processing. > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: chanduhawk >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the below error. > *eg: *val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > *Note:* > Though its the limitation of the univocity parser and the workaround is to > provide any other comment character by mentioning .option("comment","#"), but > if my actual data starts with this character then the particular row will be > discarded. > Currently I pushed the code in univocity parser to
[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178560#comment-17178560 ] chanduhawk edited comment on SPARK-32614 at 8/16/20, 5:37 PM: -- *currently spark cannt process the row that starts with null character.* If one of the rows the data file(CSV) starts with null or \u character like below(PFA screenshot) like below *null*,abc,test then spark will throw the error as mentioned in the description. i.e spark cannt process any row that starts with null character. It can only process the row if we set the options like below option("comment","a character") comment - it will take a character that needs to be treated as comment character so spark wont process the row that starts with this character The above is a work around to process the row that starts with null. But this process is also flawed and subject to skip a valid row of data that may start with the comment character. In data ware house most of the time we dont have comment charatcers concept and all the rows needs to be processed. So there should be an option in spark which will disable the processing of comment characters like below option("enableProcessingComments", false) this option will disable checking for any comment character processing. was (Author: chanduhawk): *currently spark cannt process the row that starts with null character.* If one of the rows the data file(CSV) starts with null or \u character like below(PFA screenshot) like below **null*,abc,test* then spark will throw the error as mentioned in the description. i.e spark cannt process any row that starts with null character. It can only process the row if we set the options like below option("comment","a character") comment - it will take a character that needs to be treated as comment character so spark wont process the row that starts with this character The above is a work around to process the row that starts with null. But this process is also flawed and subject to skip a valid row of data that may start with the comment character. In data ware house most of the time we dont have comment charatcers concept and all the rows needs to be processed. So there should be an option in spark which will disable the processing of comment characters like below option("enableProcessingComments", false) this option will disable checking for any comment character processing. > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: chanduhawk >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the below error. > *eg: *val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > *Note:* > Though its the limitation of the univocity parser and the workaround is to > provide any other comment character by mentioning .option("comment","#"), but > if my actual data starts with this character then the particular row will be > discarded. > Currently I pushed the code in univocity parser to handle this scenario as > part of the below PR >
[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178560#comment-17178560 ] chanduhawk edited comment on SPARK-32614 at 8/16/20, 5:37 PM: -- *currently spark cannt process the row that starts with null character.* If one of the rows the data file(CSV) starts with null or \u character like below(PFA screenshot) like below **null*,abc,test* then spark will throw the error as mentioned in the description. i.e spark cannt process any row that starts with null character. It can only process the row if we set the options like below option("comment","a character") comment - it will take a character that needs to be treated as comment character so spark wont process the row that starts with this character The above is a work around to process the row that starts with null. But this process is also flawed and subject to skip a valid row of data that may start with the comment character. In data ware house most of the time we dont have comment charatcers concept and all the rows needs to be processed. So there should be an option in spark which will disable the processing of comment characters like below option("enableProcessingComments", false) this option will disable checking for any comment character processing. was (Author: chanduhawk): If one of the rows the data file(CSV) starts with null or \u character like below(PFA screenshot) like below **null*,abc,test* then spark will throw the error as mentioned in the description. i.e spark cannt process any row that starts with null character. It can only process the row if we set the options like below option("comment","a character") comment - it will take a character that needs to be treated as comment character so spark wont process the row that starts with this character The above is a work around to process the row that starts with null. But this process is also flawed and subject to skip a valid row of data that may start with the comment character. In data ware house most of the time we dont have comment charatcers concept and all the rows needs to be processed. So there should be an option in spark which will disable the processing of comment characters like below option("enableProcessingComments", false) this option will disable checking for any comment character processing. > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: chanduhawk >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the below error. > *eg: *val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > *Note:* > Though its the limitation of the univocity parser and the workaround is to > provide any other comment character by mentioning .option("comment","#"), but > if my actual data starts with this character then the particular row will be > discarded. > Currently I pushed the code in univocity parser to handle this scenario as > part of the below PR > https://github.com/uniVocity/univocity-parsers/pull/412 > please accept the jira so that we can enable