Re:Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-06 Thread 孔维
Hi, vinoth,


I created a PR(https://github.com/apache/hudi/pull/8376) for this feature, 
could you help review it?




BR,
Kong








At 2023-04-05 00:19:20, "Vinoth Chandar"  wrote:
>Look forward to this! could really help backfill/rebootstrap scenarios.
>
>On Tue, Apr 4, 2023 at 9:18 AM Vinoth Chandar  wrote:
>
>> Thinking out loud.
>>
>> 1. For insert operations, it should not matter anyway.
>> 2. For upsert etc, the preCombine would handle the ordering problems.
>>
>> Is that what you are saying? I feel we don't want to leak any Kafka
>> specific logic or force use of special payloads etc. thoughts?
>>
>> I assigned the jira to you and also made you a contributor. So in future,
>> you can self-assign.
>>
>> On Mon, Apr 3, 2023 at 7:08 PM 孔维 <18701146...@163.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> Yea, we can create multiple spark input partitions per Kafka partition.
>>>
>>>
>>> I think the write operations can handle the potentially out-of-order
>>> events, because before writing we need to preCombine the incoming events
>>> using source-ordering-field and we also need to combineAndGetUpdateValue
>>> with records on storage. From a business perspective, we use the combine
>>> logic to keep our data correct. And hudi does not require any guarantees
>>> about the ordering of kafka events.
>>>
>>>
>>> I already filed one JIRA[https://issues.apache.org/jira/browse/HUDI-6019],
>>> could you help assign the JIRA to me?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2023-04-03 23:27:13, "Vinoth Chandar"  wrote:
>>> >Hi,
>>> >
>>> >Does your implementation read out offset ranges from Kafka partitions?
>>> >which means - we can create multiple spark input partitions per Kafka
>>> >partitions?
>>> >if so, +1 for overall goals here.
>>> >
>>> >How does this affect ordering? Can you think about how/if Hudi write
>>> >operations can handle potentially out-of-order events being read out?
>>> >It feels like we can add a JIRA for this anyway.
>>> >
>>> >
>>> >
>>> >On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:
>>> >
>>> >> Hi team, for the kafka source, when pulling data from kafka, the
>>> default
>>> >> parallelism is the number of kafka partitions.
>>> >> There are cases:
>>> >>
>>> >> Pulling large amount of data from kafka (eg. maxEvents=1), but
>>> the
>>> >> # of kafka partition is not enough, the procedure of the pulling will
>>> cost
>>> >> too much of time, even worse cause the executor OOM
>>> >> There is huge data skew between kafka partitions, the procedure of the
>>> >> pulling will be blocked by the slowest partition
>>> >>
>>> >> to solve those cases, I want to add a parameter
>>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the
>>> maxEvents in
>>> >> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
>>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
>>> >> take effect after the hoodie.deltastreamer.kafka.source.maxEvents
>>> config.
>>> >>
>>> >>
>>> >> Here is my POC of the imporvement:
>>> >> max executor core is 128.
>>> >> not turn the feature on
>>> >> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
>>> >>
>>> >>
>>> >> turn on the feature
>>> (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
>>> >>
>>> >>
>>> >> after turn on the feature, the timing of Tagging reduce from 4.4 mins
>>> to
>>> >> 1.1 mins, can be more faster if given more cores.
>>> >>
>>> >> How do you think? can I file a jira issue for this?
>>>
>>


Re:Re: Re: DISCUSS

2023-03-24 Thread 吕虎
Hi Vinoth, I am very happy to receive your reply. Here are some of my thoughts。

At 2023-03-21 23:32:44, "Vinoth Chandar"  wrote:
>>but when it is used for data expansion, it still involves the need to
>redistribute the data records of some data files, thus affecting the
>performance.
>but expansion of the consistent hash index is an optional operation right?

>Sorry, not still fully understanding the differences here,
I'm sorry I didn't make myself clearly. The expansion I mentioned last time 
refers to data records increase in hudi table.
The difference between consistent hash index and hash partition with Bloom 
filters index is how to deal with  data increase:
For consistent hash index, the way of splitting the file is used. Splitting 
files affects performance, but can permanently work effectively. So consistent 
hash index is  suitable for scenarios where data increase cannot be estimated 
or  data will increase large.
For hash partitions with Bloom filters index, the way of creating  new files is 
used. Adding new files does not affect performance, but if there are too many 
files, the probability of false positives in the Bloom filters will increase. 
So hash partitions with Bloom filters index is  suitable for scenario where 
data increase can be estimated over a relatively small range.


>>Because the hash partition field values under the parquet file in a
>columnar storage format are all equal, the added column field hardly
>occupies storage space after compression.
>Any new meta field added adds other overhead in terms evolving the schema,
>so forth. are you suggesting this is not possible to do without a new meta
>field?

No new meta field  implementation is a more elegant implementation, but for me, 
who is not yet familiar with the Hudi source code, it is somewhat difficult to 
implement, but it is not a problem for experts. If you want to implement it 
without adding new meta fields, I hope I can participate in some simple 
development, and I can also learn how experts can do it.


>On Thu, Mar 16, 2023 at 2:22 AM 吕虎  wrote:
>
>> Hello,
>>  I feel very honored that you are interested in my views.
>>
>>  Here are some of my thoughts marked with blue font.
>>
>> At 2023-03-16 13:18:08, "Vinoth Chandar"  wrote:
>>
>> >Thanks for the proposal! Some first set of questions here.
>> >
>> >>You need to pre-select the number of buckets and use the hash function to
>> >determine which bucket a record belongs to.
>> >>when building the table according to the estimated amount of data, and it
>> >cannot be changed after building the table
>> >>When the amount of data in a hash partition is too large, the data in
>> that
>> >partition will be split into multiple files in the way of Bloom index.
>> >
>> >All these issues are related to bucket sizing could be alleviated by the
>> >consistent hashing index in 0.13? Have you checked it out? Love to hear
>> >your thoughts on this.
>>
>> Hash partitioning is applicable to data tables that cannot give the exact
>> capacity of data, but can estimate a rough range. For example, if a company
>> currently has 300 million customers in the United States, the company will
>> have 7 billion customers in the world at most. In this scenario, using hash
>> partitioning to cope with data growth within the known range by directly
>> adding files and establishing  bloom filters can still guarantee
>> performance.
>> The consistent hash bucket index is also very valuable, but when it is
>> used for data expansion, it still involves the need to redistribute the
>> data records of some data files, thus affecting the performance. When it is
>> completely impossible to estimate the range of data capacity, it is very
>> suitable to use consistent hashing.
>> >> you can directly search the data under the partition, which greatly
>> >reduces the scope of the Bloom filter to search for files and reduces the
>> >false positive of the Bloom filter.
>> >the bloom index is already partition aware and unless you use the global
>> >version can achieve this. Am I missing something?
>> >
>> >Another key thing is - if we can avoid adding a new meta field, that would
>> >be great. Is it possible to implement this similar to bucket index, based
>> >on jsut table properties?
>> Add a hash partition field in the table to implement the hash partition
>> function, which can well reuse the existing partition function, and
>> involves very few code changes. Because the hash partition field values
>> under the parquet file in a columnar storage format are all equal, the
>> added column field hardly occupies storage space after compression.
>> Of course, it is not necessary to add hash partition fields in the table,
>> but to store hash partition fields in the corresponding metadata to achieve
>> this function, but it will be difficult to reuse the existing functions.
>> The establishment of hash partition and partition pruning during query need
>> more time to develop code and test again.
>> >

Re:Re: Re: [DISCUSS] Improve the merge performance for cow

2020-02-28 Thread lamberken


Hi vinoth,


Thanks for reviewing the initial design :)
I know there are many problems at present(e.g shuffling, parallelism issue). We 
can discussed the practicability of the idea first.


> ExternalSpillableMap itself was not the issue right, the serialization was
Right, the new design will not have this issue, because will not use it at all.


> This map is also used on the query side
Right, the proposal aims to improve the merge performance of cow table.


> HoodieWriteClient.java#L546 We cannot collect() the recordRDD at all ... OOM 
> driver
Here, in order to get the Map, had executed distinct() 
before collect(), the result is very small.
Also, it can be implemented in FileSystemViewManager, and lazy loading also ok.


> Doesn't this move the problem to tuning spark simply?
there are two serious performance problems in the old merge logic.
1, when upsert many records, it will serialize record to disk, then deserialize 
it when merge old record
2, only single thread comsume the old record one by one, then handle the merge 
process, it is much less efficient.   


> doing a sort based merge repartitionAndSortWithinPartitions
Trying to understand your point :) 


Compare to old version, may there are serveral improvements
1. use spark built-in operators, it's easier to understand.
2. during my testing, the upsert performance doubled.
3. if possible, we can write data in batch by using Dataframe in the futher.


[1] 
https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java


Best,
Lamber-Ken









At 2020-02-29 01:40:36, "Vinoth Chandar"  wrote:
>Does n't this move the problem to tuning spark simply? the
>ExternalSpillableMap itself was not the issue right, the serialization
>was.  This map is also used on the query side btw, where we need something
>like that.
>
>I took a pass at the code. I think we are shuffling data again for the
>reduceByKey step in this approach? For MOR, note that this is unnecessary
>since we simply log the. records and there is no merge. This approach might
>have a better parallelism of merging when that's costly.. But ultimately,
>our write parallelism is limited by number of affected files right?  So its
>not clear to me, that this would be a win always..
>
>On the code itself,
>https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java#L546
> We cannot collect() the recordRDD at all.. It will OOM the driver .. :)
>
>Orthogonally, one thing we think of is : doing a sort based merge.. i.e
>repartitionAndSortWithinPartitions()  the input records to mergehandle, and
>if the file is also sorted on disk (its not today), then we can do a
>merge_sort like algorithm to perform the merge.. We can probably write code
>to bear one time sorting costs... This will eliminate the need for memory
>for merging altogether..
>
>On Wed, Feb 26, 2020 at 10:11 PM lamberken  wrote:
>
>>
>>
>> hi, vinoth
>>
>>
>> > What do you mean by spark built in operators
>> We may can not depency on ExternalSpillableMap again when upsert to cow
>> table.
>>
>>
>> > Are you suggesting that we perform the merging in sql
>> No, just only use spark built-in operators like mapToPair, reduceByKey etc
>>
>>
>> Details has been described in this article[1], also finished draft
>> implementation and test.
>> mainly modified HoodieWriteClient#upsertRecordsInternal method.
>>
>>
>> [1]
>> https://docs.google.com/document/d/1-EHHfemtwtX2rSySaPMjeOAUkg5xfqJCKLAETZHa7Qw/edit?usp=sharing
>> [2]
>> https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
>>
>>
>>
>> At 2020-02-27 13:45:57, "Vinoth Chandar"  wrote:
>> >Hi lamber-ken,
>> >
>> >Thanks for this. I am not quite following the proposal. What do you mean
>> by
>> >spark built in operators? Dont we use the RDD based spark operations.
>> >
>> >Are you suggesting that we perform the merging in sql? Not following.
>> >Please clarify.
>> >
>> >On Wed, Feb 26, 2020 at 10:08 AM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi guys,
>> >>
>> >>
>> >> Motivation
>> >> Impove the merge performance for cow table when upsert, handle merge
>> >> operation by using spark built-in operators.
>> >>
>> >>
>> >> Background
>> >> When do a upsert operation, for each bucket, hudi needs to put new input
>> >> elements to memory cache map, and will
>> >> need an external map that spills content to disk when there is
>> >> insufficient space for it to grow.
>> >>
>> >>
>> >> There are several performance issuses:
>> >> 1. We may need an external disk map, serialize / deserialize records
>> >> 2. Only single thread do the I/O operation when check
>> >> 3. Can't take advantage of built-in spark operators
>> >>
>> >>
>> >> Based on above, reworked the merge logic and done draft test.
>> >> If you are also interested in this, please go ahead with this doc[1],
>> any
>>

Re:Re: Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

2020-02-19 Thread lamberken


@Vinoth, glad to see your reply.


>> SchemaConverters does import things like types
I checked the git history of package "org.apache.spark.sql.types", it hasn't 
changed in a year, 
means that spark does not change types often.


>> let's have a flag in maven to skip
Good suggestion. bundling it like we bundling com.databricks:spark-avro_2.11 by 
default. 
But how to use maven-shade-plugin with the flag, need to study.


Also, looking forward to others thoughts.


Thanks,
Lamber-Ken





At 2020-02-20 03:50:12, "Vinoth Chandar"  wrote:
>Apologies for the delayed response..
>
>I think SchemaConverters does import things like types and those will be
>tied to the spark version. if there are new types for e.g, our bundled
>spark-avro may not recognize them for e.g..
>
>import org.apache.spark.sql.catalyst.util.RandomUUIDGenerator
>import org.apache.spark.sql.types._
>import org.apache.spark.sql.types.Decimal.{maxPrecisionForBytes,
>minBytesForPrecision}
>
>
>I also verified that we are bundling avro in the spark-bundle.. So, that
>part we are in the clear.
>
>Here is what I suggest.. let's try bundling in the hope that it works i.e
>spark does not change types etc often and spark-avro interplays.
>But let's have a flag in maven to skip this bundling if need be.. We should
>doc his clearly on the build instructions in the README?
>
>What do others think?
>
>
>
>On Sat, Feb 15, 2020 at 10:54 PM lamberken  wrote:
>
>>
>>
>> Hi @Vinoth, sorry delay for ensure the following analysis is correct
>>
>>
>> In hudi project, spark-avro module is only used for converting between
>> spark's struct type and avro schema, only used two methods
>> `SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these two
>> methods are in `org.apache.spark.sql.avro.SchemaConverters` class.
>>
>>
>> Analyse:
>> 1, the `SchemaConverters` class are same in spark-master[1] and
>> branch-3.0[2].
>> 2, from the import statements in `SchemaConverters`, we can learn that
>> `SchemaConverters` doesn't depend on
>>other class in spark-avro module.
>>Also, I tried to move it hudi project and use a different package,
>> compile go though.
>>
>>
>> Use the hudi jar with shaded spark-avro module:
>> 1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert)
>> 2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert)
>>
>>
>> So, if we shade the spark-avro is safe and will has better user
>> experience, and we needn't shade it when spark-avro module is not external
>> in spark project.
>>
>>
>> Thanks,
>> Lamber-Ken
>>
>>
>> [1]
>> https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>> [2]
>> https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>>
>>
>>
>>
>>
>>
>>
>> At 2020-02-14 10:30:35, "Vinoth Chandar"  wrote:
>> >Just kicking this thread again, to make forward progress :)
>> >
>> >On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar  wrote:
>> >
>> >> First of all.. No apologies, no feeling bad.  We are all having fun
>> here..
>> >> :)
>> >>
>> >> I think we are all on the same page on the tradeoffs here.. let's see if
>> >> we can decide one way or other.
>> >>
>> >> Bundling spark-avro has better user experience, one less package to
>> >> remember adding. But even with the valid points raised by udit and
>> hmatu, I
>> >> was just worried about specific things in spark-avro that may not be
>> >> compatible with the spark version.. Can someone analyze how coupled
>> >> spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a
>> >> different avro version than spark 2.4.4 and when hudi-spark-bundle is
>> used
>> >> in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro
>> >> version?
>> >>
>> >> If someone can provide data points on the above and if we can convince
>> >> ourselves that we can bundle a different spark-avro version (even
>> >> spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my
>> >> position. Otherwise, if we might face a barrage of support issues with
>> >> NoClassDefFound /NoSuchMethodError etc, its not worth it IMO ..
>> >>
>> >> TBH longer term, I am looking into if we can eliminate need for Row ->
>> >> Avro conversion that we need spark-avro for. But lets ignore that for
>> >> purposes of this discussion.
>> >>
>> >> Thanks
>> >> Vinoth
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Feb 5, 2020 at 10:54 PM hmatu  wrote:
>> >>
>> >>> Thanks for raising this! +1 to @Udit Mehrotra's point.
>> >>>
>> >>>
>> >>>  It's right that recommend users to actually build their  own hudi
>> jars,
>> >>> with the spark version they use. It avoid the compatibility issues
>> >>>
>> >>> between user's local jars and pre-built hudi spark version(2.4.4).
>> >>>
>> >>> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user
>> >>> local env will contains that external depen

Re:Re: Re: [DISCUSS] Redraw of hudi data lake architecture diagram on langing page

2020-01-23 Thread lamberken


Thanks you all. :)


Hi @nishith, good catch. I fixed it. 
https://github.com/apache/incubator-hudi/pull/1276/files?short_path=55fa8a8#diff-55fa8a81e6bf8c8d9d11d293b41511b5


Thanks
Lamber-Ken







At 2020-01-24 04:43:02, "nishith agarwal"  wrote:
>+1 looks great
>
>Nit : I see that the old diagram has "Raw Ingest Tables" vs the new one
>"Row Ingest Tables". IMO, "Raw Ingest Tables" sounds more logical.
>
>-Nishith
>
>On Thu, Jan 23, 2020 at 10:57 AM Vinoth Chandar  wrote:
>
>> +1. on that :)
>>
>> On Thu, Jan 23, 2020 at 10:22 AM hmatu <3480388...@qq.com> wrote:
>>
>> > The whole site looks better than old currently, big thanks for your work!
>> >
>> >
>> > Thanks,
>> > Hmatu
>> >
>> >
>> >
>> > -- Original --
>> > From: "Balaji Varadarajan"> > Date: Fri, Jan 24, 2020 01:21 AM
>> > To: "dev"> >
>> > Subject: Re: [DISCUSS] Redraw of hudi data lake architecture diagram
>> > on langing page
>> >
>> >
>> >
>> >  +1 as well. Looks great.
>> > Balaji.V
>> >     On Thursday, January 23, 2020, 08:17:47 AM PST, Vinoth
>> > Chandar > >  
>> >  Looks good . +1 !
>> >
>> > On Wed, Jan 22, 2020 at 11:44 PM lamberken > >
>> > >
>> > >
>> > > Hello everyone,
>> > >
>> > >
>> > > I redrawed the hudi data lake architecture diagram on landing page.
>> > If you
>> > > have time, go ahead with hudi website[1] and test site[2].
>> > > Any thoughts are welcome, thanks very much. :)
>> > >
>> > >
>> > > [1] https://hudi.apache.org
>> > > [2] https://lamber-ken.github.io
>> > >
>> > >
>> > > Thanks
>> > > Lamber-Ken
>> >  
>>


Re:Re: Re: [DISCUSS] Rework of new web site

2019-12-16 Thread lamberken

Hi Vinoth,


1, I'll update the site content this week, clean some useless templete codes, 
adjust the content etc...
It will take a little long time for syncing the content.
2, I will adjust the style as much as I can to keep the theming blue and white.


When the above work is completed, I will notify you all again.
best,
lamber-ken


At 2019-12-17 12:49:23, "Vinoth Chandar"  wrote:
>Hi Lamber,
>
>+1 on the look and feel. Definitely feels slick and fast. Love the syntax
>highlighting.
>
>
>Few things :
>- Can we just update the site content as-is? ( I'd rather change just the
>look-and-feel and evolve the content from there, per usual means)
>- Can we keep the theming blue and white, like now, since it gels well with
>the logo and images.
>
>
>On Mon, Dec 16, 2019 at 8:02 AM lamberken  wrote:
>
>>
>>
>> Thanks for your reply @lees @vino @vinoth :)
>>
>>
>> best,
>> lamber-ken
>>
>>
>>
>>
>>
>>
>> 在 2019-12-16 12:24:26,"leesf"  写道:
>> >Hi Lamber,
>> >
>> >Thanks for your work, have gone through the new web ui, looks good.
>> >Hence +1 from my side.
>> >
>> >Best,
>> >Leesf
>> >
>> >vino yang  于2019年12月16日周一 上午10:17写道:
>> >
>> >> Hi Lamber,
>> >>
>> >> I am not an expert on Jekyll. But big +1 for your proposal to improve
>> the
>> >> site.
>> >>
>> >> Best,
>> >> Vino
>> >>
>> >> Vinoth Chandar  于2019年12月16日周一 上午3:15写道:
>> >>
>> >> > Thanks for taking the time to improve the site. Will review closely
>> and
>> >> get
>> >> > back to you.
>> >> >
>> >> > On Sun, Dec 15, 2019 at 11:02 AM lamberken  wrote:
>> >> >
>> >> > >
>> >> > >
>> >> > > Hello, everyone.
>> >> > >
>> >> > >
>> >> > > Compare to the web site of Delta Lake[1] and Apache Iceberg[2], they
>> >> may
>> >> > > looks better than hudi project[3].
>> >> > >
>> >> > >
>> >> > > I delved into our web ui and try to improve it, I learned that hudi
>> web
>> >> > ui
>> >> > > is based on jekyll-doc[4] theme
>> >> > > which is not active. So it needs us to find a new active theme.
>> >> > >
>> >> > >
>> >> > > So I try my best to find a free and beatiful theme in the past.
>> >> > > Fortunately, I found a suitable theme
>> >> > > in the huge amount of themes(check them one by one). It is
>> >> > > minimal-mistakes[5], it's very popular and 100% free.
>> >> > >
>> >> > >
>> >> > > Based on minimal theme, I rework a basic new web ui framework. I
>> adjust
>> >> > > some css styles, nav bars and etc..
>> >> > > If you are interested in this, please visit
>> >> https://lamber-ken.github.io
>> >> > > for a quick overview.
>> >> > >
>> >> > >
>> >> > > I’m looking forward to your reply, thanks!
>> >> > >
>> >> > >
>> >> > > [1] https://delta.io
>> >> > > [2] https://iceberg.apache.org
>> >> > > [3] http://hudi.apache.org
>> >> > > [4] https://github.com/tomjoht/documentation-theme-jekyll
>> >> > > [5] https://github.com/mmistakes/minimal-mistakes
>> >> > >
>> >> > >
>> >> > > best,
>> >> > > lamber-ken
>> >> > >
>> >> > >
>> >> >
>> >>
>>


Re:Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-11 Thread lamberken


Hi, 




On 1,2. Yes, you are right, moving the getter to the component level Config 
class itself. 


On 3, HoodieWriteConfig can also set value through ConfigOption, small code 
snippets.
From the bellow snippets, we can see that clients need to know each component's 
builders 
and also call their "with" methods to override the default value in old version.


But, in new version, clients just need to know each component's public config 
options, just like constants.
So, these builders are redundant.
 
/---/


public class HoodieIndexConfigOptions {
  public static final ConfigOption INDEX_TYPE = ConfigOption
  .key("hoodie.index.type")
  .defaultValue(HoodieIndex.IndexType.BLOOM.name());
}


public class HoodieWriteConfig {
  public void setString(ConfigOption option, String value) {
this.props.put(option.key(), value);
  }
}




/**
 * New version
 */
// set value overrite the default value
HoodieWriteConfig config = new HoodieWriteConfig();
config.set(HoodieIndexConfigOptions.INDEX_TYPE, 
HoodieIndex.IndexType.HBASE.name())




/**
 * Old version
 */
HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder()
builder.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())


/---/


Another, users use hudi like bellow, here're all keys.
/---/


df.write.format("hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
mode(Overwrite).
save(basePath);


/---/




Last, as I responsed to @vino, it's reasonable to handle fallbackkeys. I think 
we need to do this step by step,
it's easy to integrate FallbackKey in the future, it is not what we need right 
now in my opinion.


If some places are still not very clear, feel free to feedback.




Best,
lamber-ken












At 2019-12-11 23:41:31, "Vinoth Chandar"  wrote:
>Hi Lamber-ken,
>
>I looked at the sample PR you put up as well.
>
>On 1,2 => Seems your intent is to replace these with moving the getter to
>the component level Config class itself? I am fine with that (although I
>think its not that big of a hurdle really to use atm). But, once we do that
>we could pass just the specific component config into parts of code versus
>passing in the entire HoodieWriteConfig object. I am fine with moving the
>classes to a ConfigOption class as you suggested as well.
>
>On 3, I still we feel we will need the builder pattern going forward. to
>build the HoodieWriteConfig object. Like below? Cannot understand why we
>would want to change this. Could you please clarify?
>
>HoodieWriteConfig.Builder builder =
>
> HoodieWriteConfig.newBuilder().withPath(cfg.targetBasePath).combineInput(cfg.filterDupes,
>true)
>
> .withCompactionConfig(HoodieCompactionConfig.newBuilder().withPayloadClass(cfg.payloadClassName)
>// Inline compaction is disabled for continuous mode.
>otherwise enabled for MOR
>.withInlineCompaction(cfg.isInlineCompactionEnabled()).build())
>.forTable(cfg.targetTableName)
>
> .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())
>.withAutoCommit(false).withProps(props);
>
>
>Typically, we write RFCs for large changes that breaks existing behavior or
>introduces significantly complex new features.. If you are just planning to
>do the refactoring into ConfigOption class, per se you don't need a RFC.
>But , if you plan to address the fallback keys (or) your changes are going
>to break/change existing jobs, we would need a RFC.
>
>>> It is not clear to me whether there is any external facing changes which
>changes this model.
>I am still unclear on this as well. can you please explicitly clarify?
>
>thanks
>vinoth
>
>
>On Tue, Dec 10, 2019 at 12:35 PM lamberken  wrote:
>
>>
>> Hi, @Balaji @Vinoth
>>
>>
>> I'm sorry, some places are not very clear,
>>
>>
>> 1, We can see that HoodieMetricsConfig, HoodieStorageConfig, etc.. already
>> defined in project.
>>But we get property value by methods which defined in
>> HoodieWriteConfig, like HoodieWriteConfig#getParquetMaxFileSize,
>>HoodieWriteConfig#getParquetBlockSize, etc. It's means that
>> Hoodie*Config are redundant.
>>
>>
>> 2, These Hoodie*Config classes are used to set default value when call
>> their build method, nothing e

Re:Re: Re:[DISCUSS] Scaling community support

2019-12-08 Thread lamberken


Okay, thanks for reminding me, I'll see earlier discuss thread.


At 2019-12-09 14:09:56, "Vinoth Chandar"  wrote:

Please see an earlier discuss thread on the same topic - GH issues. 


Lets please keep this thread to discuss support process, not logistics, if I 
may say so :)


On Sun, Dec 8, 2019 at 10:03 PM lamberken  wrote:



In addition, we can use some tags to mark these issues, like "question", "bug", 
"new feature". we can solve these bug firstly.




Best,
lamber-ken








At 2019-12-09 13:43:38, "lamberken"  wrote:
>
>
>Hi, I'd like to make suggestions from the perspective of contributor, just for 
>reference only.
>
>
>About [1]
>As hudi project grows, users / developers will encounter various problems, 
>will asking issues on this mailing list or GH issues or occasionally slack. I 
>think committers should guide them to create a related jira about their 
>problems firstly.
>Because committers or PMC may focusing on thier work(fix a bug / develop new 
>features), and don't have enough time to answer these
>occasionally issues. We can see that Spark, Flink, Hadoop or other popular 
>projects have turned the issue off on github. Users can not
>create issue on GH, they can create a jira or send a email, so committers / 
>PMC can solve these issues in order. 
>
>
>https://github.com/apache/spark
>https://github.com/apache/flink
>https://github.com/apache/calcite
>https://github.com/apache/hadoop
>
>
>Best,
>lamber-ken
>
>
>
>At 2019-12-08 04:01:13, "Vinoth Chandar"  wrote:
>>Hello all,
>>
>>As we grow, we need a scalable way for new users/contributors to either
>>easily use Hudi or ramp up on the project. Last month alone, we had close
>>to 1600 notifications on commits@. and few hundred emails on this list. In
>>addition, to authoring RFCs and implementing JIRAs we need to share the
>>following responsibilities amongst us to be able to scale this process.
>>
>>1) Answering issues on this mailing list or GH issues or occasionally
>>slack. We need a clear owner to triage the problem, reproduce it if needed,
>>either provide suggestions or file a JIRA - AND always look for ways to
>>update the FAQ. We need a clear hand off process also.
>>2) Code review process currently spreads the load amongst all the
>>committers. But PRs vary dramatically in their complexity and we need more
>>committers who can review any part of the codebase.
>>3) Responding to pings/clarifications and unblocking . IMHO committers
>>should prioritize this higher than working on their own stuff (I know I
>>have been doing this at some cost to my productivity on the project). This
>>is the only way to scale and add new committers. committers need to be
>>nurturing in this process.
>>
>>I don't have a clear proposals for scaling 2 & 3, which fall heavily on
>>committers.. Love to hear suggestions.
>>
>>But for 1, I propose we have 2-3 day "Support Rotations" where any
>>contributor can assume responsibility for support the community. This
>>brings more focus to support and also fast tracks learning/ramping for the
>>person on the rotation. It also minimizes interruptions for other folks and
>>we gain more velocity. I am sure this is familiar to a lot of you at your
>>own companies. We have at-least 10-15 active contributors at this point..
>>So  the investment is minimal : doing this once a month.
>>
>> A committer and a PMC member will always be designated secondary/backup in
>>case the primary cannot field a question. I am happy to additionally
>>volunteer as "always on rotation" as a third level backup, to get this
>>process booted up.
>>
>>Please let me know what you all think. Please be specific in what issue
>>[1][2] or [3] you are talking about in your feedback
>>
>>thanks
>>vinoth


Re:Re: Re: [DISCUSS] Refactor scala checkstyle

2019-12-06 Thread lamberken


OK, thank you for your reply. I will start to work on this.


在 2019-12-06 22:31:22,"Vinoth Chandar"  写道:
>+1 from me as well.
>
>On Fri, Dec 6, 2019 at 6:25 AM leesf  wrote:
>
>> +1 to refractor the scala checkstyle.
>>
>> Best,
>> Leesf
>>
>> lamberken  于2019年12月6日周五 下午8:00写道:
>>
>> > Right, refactor step by step like java style.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > At 2019-12-06 16:35:04, "vino yang"  wrote:
>> > >Hi lamber,
>> > >
>> > >+1 from my side.
>> > >
>> > >IMO, it would be better to refactor step by step like java style.
>> Firstly,
>> > >we should refactor code based on warning message, then change the
>> > >checkstyle rule level.
>> > >
>> > >WDYT? Is it what you prepare to do?
>> > >
>> > >Best,
>> > >Vino
>> > >
>> > >
>> > >lamberken  于2019年12月6日周五 下午2:39写道:
>> > >
>> > >> Hi,
>> > >>
>> > >>
>> > >> Currently, the level of scala codestyle rule is warning, it's better
>> > check
>> > >> these rules one by one
>> > >> and refactor scala codes then now.
>> > >>
>> > >>
>> > >> Furthermore, in order to sync to java codestyle, needs to add two
>> rules.
>> > >> One is BlockImportChecker
>> > >> which allows to ensure that only single imports are used in order to
>> > >> minimize merge errors in import declarations, another is
>> > ImportOrderChecker
>> > >> which checks that imports are grouped and ordered according to the
>> style
>> > >> configuration.
>> > >>
>> > >>
>> > >> Summary
>> > >> 1, check scala checkstyle rules one by one, change some warning level
>> to
>> > >> error.
>> > >> 2, add ImportOrderChecker and BlockImportChecker.
>> > >>
>> > >>
>> > >> Any comments and feedback are welcome, WDYT?
>> > >>
>> > >>
>> > >> Best,
>> > >> lamber-ken
>> >
>>