[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-09-04 Thread GitBox


garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-687479902


   @luffyd Thanks for reporting. I created a ticket to track this: 
https://issues.apache.org/jira/browse/HUDI-1270



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-08-06 Thread GitBox


garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-670078060


   > ```
   > [ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
66.614 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
   > [ERROR] 
org.apache.hudi.functional.TestCOWDataSource.testStructuredStreaming  Time 
elapsed: 25.766 s  <<< ERROR!
   > java.util.concurrent.ExecutionException: Boxed Error
   > Caused by: org.opentest4j.AssertionFailedError: expected: <2> but was: <3>
   > ```
   > 
   > @garyli1019 this is failing now. See my last fix. just unsetting the path 
filter after resolving the relation seems to help get over the issue. So the 
root issue there was the path filter still kicking in for 
`spark.read.format('parquet')` (which is kind of expected even)
   
   @vinothchandar agree. I can see the InMemoryFileIndex warning on every test 
following the SparkStreaming test, but don't see it if I run some COW test 
alone. I will try to change the test set up to clean the spark context properly.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-08-03 Thread GitBox


garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-668398894


   > to be clear, you are saying it should all be working correct? assuming you 
may not have conflicts with #1807 , can you please rebase this off latest 
masteR?
   
   Yes the custom payload is working. Done rebasing.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-08-03 Thread GitBox


garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-668363766


   > small summary on what the follow-up work here is?
   
   @vinothchandar For 0.6.0 release, the only one left is incremental pulling. 
I am currently working on it and will probably get it done by tomorrow. Maybe 
we can wait until tomorrow to review the whole thing.
   The vectorized reader and pruning are supported in the current version.
   
   > custom payload support?
   
   Yes, will let the `HoodieMergedLogRecordScanner` handle the payload loading 
and only need to specify `merge` or `skipmerge` when running the query. 
Included unit test for delete of `OverwriteWithLatestAvroPayload` to verify 
custom payload mechanism.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-31 Thread GitBox


garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-667466538


   Tested on 100GB MOR table. A few partitions have 100% duplicate upsert log 
file, the other has parquet files only.
   For parquet files only partitions, the `SNAPSHOT` query is as efficient as 
the `READ_OPTIMIZED` query. The file split with log files is expensive but is 
expected.
   For one 50MB parquet file, the log file was ~1GB. Each file split has been 
loaded as one task.
   Count performance for 50MB parquet + 1GB log:
   merge: 40s
   unmerge: 40s
   Show performance. Because data source V1 doesn't support `limit()`, so it 
will just scan the whole file.
   without column pruning: df_mor.show(10) took 40s
   with column pruning: df_mor.select("_hoodie_commit_time").show(10) took 27s
   @vinothchandar @umehrot2 @bvaradar 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-30 Thread GitBox


garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-666091024


   @bvaradar I tested on Spark 2.4.0 cdh release with a small dataset, and 
found a broadcast configuration issue. Pushed a new commit with the fix. Now 
this work fine on my cluster. I will test a larger dataset tomorrow.
   I couldn't reproduce `java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.PartitionedFile`. Is this class 
somehow missing on aws release? Are you able to import from the spark-shell?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-29 Thread GitBox


garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-665798153


   @bvaradar Thanks for trying this out. `java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.PartitionedFile` looks strange. I 
will try it out on my production today to see if I can reproduce. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-26 Thread GitBox


garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-664055034


   Added support for `PruneFilterScan`. Please review this PR again. Thank you!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-20 Thread GitBox


garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-661642358


   @vinothchandar @umehrot2 Ready for review. Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org