[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-599304956 @nsivabalan Done squashing the commits. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-599304487 I plan to write one after this is merged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-597457998 > One question about using nested schema. Can you remind me what happens if someone passes in a nested schema for CsvDeltaStreamer? I used some code below to test the nested schema for CSV reader in Spark. It throws the following exception, which means that Spark CSV source does not support nested schema currently. In most cases, the CSV schemas should be flattened. It depends on Spark's behavior whether nested schema is supported for CSV source (in the future nested schema may be supported for CSV). So we don't enforce the check in our Hudi code. ``` org.apache.spark.sql.AnalysisException: CSV data source does not support struct data type.; at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:69) at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:67) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifySchema(DataSourceUtils.scala:67) at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifyReadSchema(DataSourceUtils.scala:41) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:400) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188) at org.apache.hudi.utilities.sources.CsvDFSSource.fromFiles(CsvDFSSource.java:120) at org.apache.hudi.utilities.sources.CsvDFSSource.fetchNextBatch(CsvDFSSource.java:93) at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:73) at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInAvroFormat(SourceFormatAdapter.java:66) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:317) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121) at org.apache.hudi.utilities.TestHoodieDeltaStreamer.testCsvDFSSourceWithNestedSchema(TestHoodieDeltaStreamer.java:812) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.junit.runner.JUnitCore.run(JUnitCore.java:160) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33) at
[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-597292426 Sorry for the delay. I'll get to this PR this week. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-578554164 @bvaradar I added more javadoc and checked that Spark CSV supports timestamp-type fields. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-576069548 @bvaradar @leesf Could any of you review this PR by EOD? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-576069136 @vinothchandar From my side, the code change is ready. I'm not sure if it can be reviewed and merged in time. I'm fine with pushing this to v0.6.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-575977167 `TestCsvDFSSource` will be added once #1239 is merged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-575976539 @bvaradar This PR is ready for review. @leesf @vinothchandar Feel free to also review this PR. I'm not sure if we can merge this PR by the release cut. If not, we can add this feature to the next release. Thanks @UZi5136225 for helping test the functionality of this PR and reporting the [issue](https://issues.apache.org/jira/browse/HUDI-552) of corrupt data generated from DeltaStreamer with text files (CSV format with no header line). The latter has been fix in another [PR](https://github.com/apache/incubator-hudi/pull/1246). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services