Abdel, select * on my csv file fails as well
Thanks 2016-02-01 17:16 GMT+01:00 Abdel Hakim Deneche <[email protected]>: > When you run a select * on your csv file, does it succeed or fail ? > > On Mon, Feb 1, 2016 at 7:53 AM, Nicolas Paris <[email protected]> wrote: > > > @Abdel, > > > > Yes problem is similar. By the way, the jira issue allready exists > isnt'it > > ? > > > > > https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg > > If not, I would be glad to add one. Just tell me why > > > > @Ted, > > > > If you have new lines in your files then the files becomes unsuitable for > > splitting. This means that the only parallelism available in a ctas > > statement is multiple files > > > > Does it means newlines are incompatible with drill's distributed > calculus > > ? > > > > Do you have a fair number of files? > > I have one 30GB csv file. I don't know how many parquet file it could > > create as process crashes because of newlines. > > I can imagine approx 5 parquet files 500 MB. > > > > Thanks, > > > > > > 2016-02-01 16:41 GMT+01:00 Abdel Hakim Deneche <[email protected]>: > > > > > Another user already reported some problems querying csv files with new > > > line characters: > > > > > > http://comments.gmane.org/gmane.comp.apache.incubator.drill.user/2350 > > > > > > His particular problem was related to a bug in the LIKE function. > > > Unfortunately he never got around to fill a JIRA for his issue. > > > > > > Is your problem similar ? if yes, then can you please fill a JIRA ? > > > > > > On Mon, Feb 1, 2016 at 7:26 AM, Nicolas Paris <[email protected]> > > wrote: > > > > > > > Hello Abdel, > > > > > > > > I am creating parquet file from those CSV files. (CREATE TABLE > syntax). > > > > Basically, I have a text column, with a maximum of 50k characters, > > > > containing newlines (the texts come from pdf extracted). I have > > > > multimilions tuples of texts. I am subseting texts containing some > > > patterns > > > > (LIKE '%foo%' or regex => sadly I haven't found mention about regex > in > > > > documentation (postgresql "~" operator equivalent)) > > > > Usually I used postgresql or monetdb in order to mine the texts, but > I > > am > > > > benchmarking/studying apache drill too. > > > > > > > > Thanks, > > > > > > > > > > > > 2016-02-01 15:54 GMT+01:00 Abdel Hakim Deneche < > [email protected] > > >: > > > > > > > > > Hey Nicolas, > > > > > > > > > > what kind of queries are you running on your csv file ? > > > > > > > > > > On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > I am trying to import a csv containing large texts. They contains > > > > newline > > > > > > character "\n". > > > > > > Apache Drill conplains about that. There is a jira issue opened > on > > > > > > > > > > > > > > > > > > > > > > > > > > > https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg > > > > > > > > > > > > Is there a workaround ? (different that removing \n from texts) > > > > > > > > > > > > Thanks by advance > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Abdelhakim Deneche > > > > > > > > > > Software Engineer > > > > > > > > > > <http://www.mapr.com/> > > > > > > > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > > > > < > > > > > > > > > > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Abdelhakim Deneche > > > > > > Software Engineer > > > > > > <http://www.mapr.com/> > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > > < > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > > > > > > > > -- > > Abdelhakim Deneche > > Software Engineer > > <http://www.mapr.com/> > > > Now Available - Free Hadoop On-Demand Training > < > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > >
