Luoc, Thanks for your reply. Can you point me to documentation about how to switch readers?
On Fri, May 21, 2021 at 7:08 AM luoc <l...@apache.org> wrote: > Hi Ted, > You can use the new version of CSV reader (binding the > CompliantTextBatchReader) to query the CSV since 1.16 (no changes in the > usage). But this reader does not support your idea. I think we can provide > a few codes to enhance the reader. All the new storage and format plugin > base the EVF, more powerful and stable. > > > 2021年5月20日 下午10:40,Ted Dunning <ted.dunn...@gmail.com> 写道: > > > > Luoc, > > > > How do I use the CompliantTextBatchReader? > > > > How is the speed? > > > > Can you point me at the old CSV reader? I am not sure where it is. > > > > > > > > On Thu, May 20, 2021 at 1:09 AM luoc <l...@apache.org> wrote: > > > >> Hello Ted, > >> It's nice idea. I have done a quick review for the CSV reader, but not > >> found any settings to process the errors. And then, We have refactored > the > >> CSV format using the EVF, please see the CompliantTextBatchReader.java > >> (Complies with the RFC 4180 standard for text/csv files). > >> > >>> 在 2021年5月20日,13:49,Ted Dunning <ted.dunn...@gmail.com> 写道: > >>> > >>> I have a csv file that causes an exception when read by Drill. The > file > >> is > >>> slightly mal-formed (but R can read it). > >>> > >>> Interestingly, if I don't parse the header line, I don't get the > >> exception > >>> and the problematic embedded quotes are handled well. Likewise, > deleting > >>> the first data line (which is well-formed) causes the exception to go > >> away. > >>> Deleting the second data line also causes the exception to stop. Fixing > >> the > >>> quoting of the included quotes also fixes the problem. Swapping the > lines > >>> works like deleting the first line. Repeating the first line after the > >>> second line still gets the exception. > >>> > >>> The file is this: > >>> ------------------------- > >>> > >>> desc,name > >>> > >>> "foo","x" > >>> > >>> "manure called "foo"","y" > >>> > >>> ------------- > >>> > >>> > >>> The exception is shown below. My thought is that if the CSV file is > >>> considered mal-formed, we should get an error on the line that says > >>> something along the lines of "mal-formed input". Even better would be > to > >>> allow such lines to be omitted (up to some sanity limit) or to parse it > >>> correctly (which happens without headers being parsed). > >>> > >>> Anybody have any thoughts? > >>> > >>> Here is the R behavior (it omits the embedded quotes): > >>> > >>>> f = read.csv("v.csv") > >>> > >>>> f > >>> > >>> desc name > >>> > >>> 1 foo x > >>> > >>> 2 manure called foo y > >>> > >>> > >>> And here is the exception: > >>> > >>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: > >>> NegativeArraySizeException Please, refer to logs for more information. > >>> [Error Id: 7153f837-45eb-43d1-8e19-e3ca0197c61b ] > >>> (java.lang.NegativeArraySizeException) null > >>> org.apache.drill.exec.vector.VarCharVector$Accessor.get():487 > >>> org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():514 > >>> org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():475 > >>> org.apache.drill.exec.server.rest.WebUserConnection.sendData():147 > >>> org.apache.drill.exec.ops.AccountingUserConnection.sendData():42 > >>> > >> > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():120 > >>> org.apache.drill.exec.physical.impl.BaseRootExec.next():94 > >>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296 > >>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283 > >>> java.security.AccessController.doPrivileged():-2 > >>> javax.security.auth.Subject.doAs():422 > >>> org.apache.hadoop.security.UserGroupInformation.doAs():1669 > >>> org.apache.drill.exec.work.fragment.FragmentExecutor.run():283 > >>> org.apache.drill.common.SelfCleaningRunnable.run():38 > >>> java.util.concurrent.ThreadPoolExecutor.runWorker():1149 > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run():624 > >>> java.lang.Thread.run():748 > >> > >