Maximilian said if we handle null value with String, it would be acceptable. But in fact, readCsvFile() still cannot accept null value; they said "Row too short" in error msg.
case class WebClick(click_date: String, click_time: String, user: String, item: String) private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] = { env.readCsvFile[WebClick]( webClickPath, fieldDelimiter = "|", includedFields = Array(0, 1, 3, 5) //lenient = true ) } // e.g. 36890|26789|0|3725|20|85457 // e.g _date|_click|_sales|_item|_web_page|_user Caused by: org.apache.flink.api.common.io.ParseException: Row too short: 36890|4749||13183|29| at org.apache.flink.api.common.io.GenericCsvInputFormat.parseRecord(GenericCsvInputFormat.java:383) at org.apache.flink.api.scala.operators.ScalaCsvInputFormat.readRecord(ScalaCsvInputFormat.java:214) at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:454) at org.apache.flink.api.scala.operators.ScalaCsvInputFormat.nextRecord(ScalaCsvInputFormat.java:182) at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:176) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:745) Is there any suggestion? On Fri, Oct 23, 2015 at 7:18 PM, Shiti Saxena <ssaxena....@gmail.com> wrote: > For a similar problem where we wanted to preserve and track null entries, > we load the CSV as a DataSet[Array[Object]] and then transform it into > DataSet[Row] using a custom RowSerializer( > https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null. > > The Table API(which supports null) can then be used on the resulting > DataSet[Row]. > > > On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels <m...@apache.org> > wrote: > >> Hi Philip, >> >> How about making the empty field of type String? Then you can read the >> CSV into a DataSet and treat the empty string as a null value. Not very >> nice but a workaround. As of now, Flink deliberately doesn't support null >> values. >> >> Regards, >> Max >> >> >> On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee <philjj...@gmail.com> wrote: >> >>> Hi, >>> >>> I am trying to load the dataset with the part of null value by using >>> readCsvFile(). >>> >>> // e.g _date|_click|_sales|_item|_web_page|_user >>> >>> case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, >>> _item: Int,_page: Int, _user: Int) >>> >>> private def getWebClickDataSet(env: ExecutionEnvironment): >>> DataSet[WebClick] = { >>> >>> env.readCsvFile[WebClick]( >>> webClickPath, >>> fieldDelimiter = "|", >>> includedFields = Array(0, 1, 2, 3, 4, 5), >>> // lenient = true >>> ) >>> } >>> >>> >>> Well, I know there is an option to ignore malformed value, but I have to >>> read the dataset even though it has null value. >>> >>> as it follows, dataset (third column is null) looks like >>> 37794|24669||16705|23|54810 >>> but I have to read null value as well because I have to use filter or >>> where function ( _sales == null ) >>> >>> Is there any detail suggestion to do it? >>> >>> Thanks, >>> Philip >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> ========================================================== >>> >>> *Hae Joon Lee* >>> >>> >>> Now, in Germany, >>> >>> M.S. Candidate, Interested in Distributed System, Iterative Processing >>> >>> Dept. of Computer Science, Informatik in German, TUB >>> >>> Technical University of Berlin >>> >>> >>> In Korea, >>> >>> M.S. Candidate, Computer Architecture Laboratory >>> >>> Dept. of Computer Science, KAIST >>> >>> >>> Rm# 4414 CS Dept. KAIST >>> >>> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701) >>> >>> >>> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea >>> >>> ========================================================== >>> >> >> > -- ========================================================== *Hae Joon Lee* Now, in Germany, M.S. Candidate, Interested in Distributed System, Iterative Processing Dept. of Computer Science, Informatik in German, TUB Technical University of Berlin In Korea, M.S. Candidate, Computer Architecture Laboratory Dept. of Computer Science, KAIST Rm# 4414 CS Dept. KAIST 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701) Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea ==========================================================