Re: Issue while querying Hive table after updates

2019-11-18 Thread Gurudatt Kulkarni
Hi Bhavani Sudha, >> Are you using spark sql or Hive query? This happens on all hive, hive on spark, spark sql. >> the table type , This happens for both copy on write and merge on read. >> configs, hoodie.upsert.shuffle.parallelism=2 hoodie.insert.shuffle.parallelism=2 hoodie.bulkinsert

Re: [DISCUSS] Hide Github issues tab and Unified management of issues in JIRA

2019-11-18 Thread Gurudatt Kulkarni
> With templates, we can collect good information while people file the > issues..Not sure about permissions we have on JIRA to enable bots, but may > have more luck on github workflows doing these already? > Can we do templates/required fields with JIRAs as well? Yes, it is very much possible to

Re: [DISCUSS] Hide Github issues tab and Unified management of issues in JIRA

2019-11-18 Thread vino yang
Hi Gurudatt and Vinoth, Thanks for sharing your valuable opinion. Considering Hudi is still a growing project. I agree that it's better to keep Github's Issues tab as a way to discuss problems currently. +1 to introduce issue template and management bot. Best, Vino Vinoth Chandar 于2019年11月19日

Re: [DISCUSS] Introduce stricter comment and code style validation rules

2019-11-18 Thread Vinoth Chandar
+1 on all three. Would there be a overhaul of existing code to add comments to all classes? We are pretty reasonable already, but good to get this in shape. 17:54:37 [incubator-hudi]$ grep -R -B 1 "public class" hudi-*/src/main/java | grep "public class" | wc -l 274 17:54:50 [incubator-hudi]

Re: [DISCUSS] Introduce stricter comment and code style validation rules

2019-11-18 Thread lamberken
+1, it’s a hard work but meaningful. | | lamberken IT | | ly.com lamber...@163.com | 签名由网易邮箱大师定制 On 11/19/2019 07:27,leesf wrote: Hi vino, Thanks for bringing ths discussion up. +1 on all. the third one seems a bit too strict and usually requires manual processing of the import order, but I al

Re: [DISCUSS] Introduce stricter comment and code style validation rules

2019-11-18 Thread leesf
Hi vino, Thanks for bringing ths discussion up. +1 on all. the third one seems a bit too strict and usually requires manual processing of the import order, but I also agree and think it makes our project more professional. And I learned that the calcite community is also applying this rule. Best,

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-18 Thread Bhavani Sudha
Hi Pratyaksh, Let me try to answer this. I believe spark does not natively invoke HoodieParquetInputFormat.getSplits() like Hive and Presto does. So when queried, spark just loads all the data files in that partition without applying Hoodie filtering logic. Thats why we need to instruct Spark to r

Re: Issue while querying Hive table after updates

2019-11-18 Thread Bhavani Sudha
Hi Gurudatt, Can you share more context on the table and the query. Are you using spark sql or Hive query? the table type , etc? Also, if you can provide a small snippet to reproduce with the configs that you used, it would be useful to debug. Thanks, Sudha On Sun, Nov 17, 2019 at 11:09 PM Gurud

Re: [DISCUSS] Hide Github issues tab and Unified management of issues in JIRA

2019-11-18 Thread Vinoth Chandar
If we decide to keep GitHub Issues, both great suggestions. We should still debate if we keep GH issues. I just shared my opinion. :) With templates, we can collect good information while people file the issues..Not sure about permissions we have on JIRA to enable bots, but may have more luck on g

Re: Reporting 0.5.0-incubating release to reporter.apache.org

2019-11-18 Thread Vinoth Chandar
https://jira.apache.org/jira/browse/HUDI-343. tracks this On Sat, Nov 16, 2019 at 1:46 PM Thomas Weise wrote: > Sorry for the late reply. > > The reporter is applicable to top level projects. > > But please create a DOAP file for Hudi, where you can also list the > release: https://projects.apac

Re: [DISCUSS] Hide Github issues tab and Unified management of issues in JIRA

2019-11-18 Thread Gurudatt Kulkarni
Hi Vinoth / Vino, Just adding my 2 cents to the discussion. Yes, I agree that GitHub issues are low friction and can be the first line of support. It will help in keeping the JIRA clean. Potential solutions that I have come across in the community, 1. Introduce an issue template. 2. Add a bot th

Re: [DISCUSS] Hide Github issues tab and Unified management of issues in JIRA

2019-11-18 Thread Vinoth Chandar
@vinoyang. All valid points. I just have 1 argument (all others you are right and I have always known this tradeoff) for keeping Github issues, when we are still growing the community and that is : it lets anyone with a github id raise an issue without forcing to sign up for JIRA account. For large

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-18 Thread Purushotham Pushpavanthar
Figured out. Below command worked for me in PySpark. *spark._jsc.hadoopConfiguration().set('mapreduce.input.pathFilter.class','org.apache.hudi.hadoop.HoodieROTablePathFilter')* Regards, Purushotham Pushpavanth On Mon, 18 Nov 2019 at 16:47, Purushotham Pushpavanthar < pushpavant...@gmail.com> w

Re: [DISCUSS] Introduce stricter comment and code style validation rules

2019-11-18 Thread Pratyaksh Sharma
Having proper class level and method level comments always makes the life easier for any new user. +1 for points 1,2 and 4. On Mon, Nov 18, 2019 at 5:59 PM vino yang wrote: > Hi guys, > > Currently, Hudi's comment and code styles do not have a uniform > specification on certain rules. I will li

[DISCUSS] Introduce stricter comment and code style validation rules

2019-11-18 Thread vino yang
Hi guys, Currently, Hudi's comment and code styles do not have a uniform specification on certain rules. I will list them below. With the rapid development of the community, the inconsistent comment specification will bring a lot of problems. I am here to assume that everyone is aware of its impor

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-18 Thread Purushotham Pushpavanthar
Kabeer, can you please share *PySpark* command to register pathfileter class? Regards, Purushotham Pushpavanth On Mon, 18 Nov 2019 at 13:46, Pratyaksh Sharma wrote: > Hi Vinoth/Kabeer, > > I have one small doubt regarding what you proposed to fix the issue. Why is > HoodieParquetInputFormat c

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-18 Thread Pratyaksh Sharma
Hi Vinoth/Kabeer, I have one small doubt regarding what you proposed to fix the issue. Why is HoodieParquetInputFormat class not able to handle deduplication of records in case of spark while it is able to do so in case of presto and hive? On Sun, Nov 17, 2019 at 4:08 AM Vinoth Chandar wrote: >