[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-06-10 Thread Bhavani Sudha (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132587#comment-17132587
 ] 

Bhavani Sudha commented on HUDI-69:
---

Sorry, accidentally assigned it to myself. Reverted the change.

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-05-04 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099507#comment-17099507
 ] 

Yanjia Gary Li commented on HUDI-69:


Can anyone reopen this ticket? I accidentally closed this :)

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-15 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084562#comment-17084562
 ] 

Vinoth Chandar commented on HUDI-69:


Switching to v2 for writing is tricky... V2 datasource API takes a lot of 
control away... for e.g we need to define a bunch of things at the individual 
task level.. not shuffle and do indexing etc like we do today... This needs 
more thought IMO 

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-15 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084560#comment-17084560
 ] 

Vinoth Chandar commented on HUDI-69:


. I was wondering if we can just wrap the FileFormat (Parquet/ORC both have 
formats inside Spark) , reuse its record reader for reading parquet/orc -> Row 
and also use our existing LogReader classes to read the log blocks are Row 
(instead of GenericRecord.. or we can for now do GenericRecord -> Row ).. This 
means, we need to redesign our CompactedRecordScanner etc classes to be generic 
and not implicitly assume it merging Avro/ArrayWritable per se. Must be doable.

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-13 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082773#comment-17082773
 ] 

Yanjia Gary Li commented on HUDI-69:


After a closer look, I think Spark datasource support for realtime table needs:
 * Refactoring HoodieRealtimeFormat and (file split, record reader). Decouple 
Hudi logic from the MapredParquetInputFormat. I think we can maintain the Hudi 
file split and path filtering in a central place, and able to be adopted by 
different query engines. With bootstrap support, the file format maintenance 
could be more complicated. I think this is very essential. 
 * Implement the extension of ParquetInputFormat from Spark or a custom data 
source reader to handle merge. 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala]
 * Use Datasource V2 to be the default data source. 

Please let me know what you guys think. 

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-07 Thread Bhavani Sudha (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077314#comment-17077314
 ] 

Bhavani Sudha commented on HUDI-69:
---

[~garyli1019] Yes the InputPathHandler will be able to provide MOR snapshot 
paths. However I think the FileInputFormat filters out hidden files by default. 
The log files start with  a `.` and hence are treated as hidden files by the 
FileInputFormat class. Given this context, when we do super.listStatus from 
HoodieParquetInputFormat - 
[https://github.com/apache/incubator-hudi/blob/b5d093a21bbb19f164fbc549277188f2151232a8/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java#L107]
 the log files are not listed.

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-05 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076023#comment-17076023
 ] 

Yanjia Gary Li commented on HUDI-69:


Hello [~bhasudha], I found your commit 
[https://github.com/apache/incubator-hudi/commit/d09eacdc13b9f19f69a317c8d08bda69a43678bc]
 could be related to this ticket.

Does InputPathHandler able to provide MOR snapshot paths(avro + parquet)? If 
not, I could probably start from the path selector. 

To add Spark Datasource support RealtimeUnmergedRecordReader, we may simply use 
the Spark SQL API to read two separate formats then union them together. Is 
that make sense? 

To merge them, I might need to dig deeper. 

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-01 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072488#comment-17072488
 ] 

Vinoth Chandar commented on HUDI-69:


[~garyli1019] Great.. Assigned to you!

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-03-31 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072382#comment-17072382
 ] 

Yanjia Gary Li commented on HUDI-69:


[~vinoth] I am happy to work on this ticket. Please assign to me

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)