[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-15 Thread GitBox
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-599304956
 
 
   @nsivabalan Done squashing the commits.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-15 Thread GitBox
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-599304487
 
 
   I plan to write one after this is merged.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-10 Thread GitBox
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-597457998
 
 
   > One question about using nested schema. Can you remind me what happens if 
someone passes in a nested schema for CsvDeltaStreamer?
   
   I used some code below to test the nested schema for CSV reader in Spark.  
It throws the following exception, which means that Spark CSV source does not 
support nested schema currently.
   
   In most cases, the CSV schemas should be flattened.  It depends on Spark's 
behavior whether nested schema is supported for CSV source (in the future 
nested schema may be supported for CSV).  So we don't enforce the check in our 
Hudi code. 
   
   ```
   org.apache.spark.sql.AnalysisException: CSV data source does not support 
struct data type.;
   
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:69)
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:67)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifySchema(DataSourceUtils.scala:67)
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifyReadSchema(DataSourceUtils.scala:41)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:400)
at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
at 
org.apache.hudi.utilities.sources.CsvDFSSource.fromFiles(CsvDFSSource.java:120)
at 
org.apache.hudi.utilities.sources.CsvDFSSource.fetchNextBatch(CsvDFSSource.java:93)
at 
org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43)
at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:73)
at 
org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInAvroFormat(SourceFormatAdapter.java:66)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:317)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
at 
org.apache.hudi.utilities.TestHoodieDeltaStreamer.testCsvDFSSourceWithNestedSchema(TestHoodieDeltaStreamer.java:812)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at 
com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
at 

[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-10 Thread GitBox
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-597292426
 
 
   Sorry for the delay.  I'll get to this PR this week.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-01-26 Thread GitBox
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-578554164
 
 
   @bvaradar I added more javadoc and checked that Spark CSV supports 
timestamp-type fields.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-01-19 Thread GitBox
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-576069548
 
 
   @bvaradar @leesf Could any of you review this PR by EOD?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-01-19 Thread GitBox
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-576069136
 
 
   @vinothchandar From my side, the code change is ready.  I'm not sure if it 
can be reviewed and merged in time.  I'm fine with pushing this to v0.6.0.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-01-18 Thread GitBox
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-575977167
 
 
   `TestCsvDFSSource` will be added once #1239 is merged.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-01-18 Thread GitBox
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-575976539
 
 
   @bvaradar This PR is ready for review.
   
   @leesf @vinothchandar Feel free to also review this PR.  I'm not sure if we 
can merge this PR by the release cut.  If not, we can add this feature to the 
next release.
   
   Thanks @UZi5136225 for helping test the functionality of this PR and 
reporting the [issue](https://issues.apache.org/jira/browse/HUDI-552) of 
corrupt data generated from DeltaStreamer with text files (CSV format with no 
header line).  The latter has been fix in another 
[PR](https://github.com/apache/incubator-hudi/pull/1246).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services