[jira] [Created] (CARBONDATA-294) Timestamp datatype Error
Lionx created CARBONDATA-294: Summary: Timestamp datatype Error Key: CARBONDATA-294 URL: https://issues.apache.org/jira/browse/CARBONDATA-294 Project: CarbonData Issue Type: Bug Reporter: Lionx Assignee: Lionx Priority: Critical In CarbonExample, When Loading 2015/7/23 as a Timestamp, when querying, it will return 2015-01-23 xx:xx:xx:xx. Six months have been stolen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Discussion regrading design of data load after kettle removal.
Hi Ravindra, It seems the picture is missing, can you post it in a URL and share the link? Regards, Jacky -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-regrading-design-of-data-load-after-kettle-removal-tp1672p1725.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
[GitHub] incubator-carbondata pull request #225: Abstract Snappy interface and sepera...
GitHub user Zhangshunyu opened a pull request: https://github.com/apache/incubator-carbondata/pull/225 Abstract Snappy interface and seperate it from Compressor interface ## Why raise this pr? Currently, we only have snappy compressor who extends form Compressor interface, for future expansion, we need to abstract Snappy interface and seperate it from Compressor interface, it means `Compressor interface is the parent of all compressors, and SnappyCompressor and the other compressor's interface should extends Compressor interface, as to different data type for different compressor, it would extend its own interface.` ## How to test? Pass all the test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Zhangshunyu/incubator-carbondata compress_interface Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/225.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #225 commit 98536737d786c40192d197a3af3e52254949d4fd Author: Zhangshunyu Date: 2016-10-10T09:17:31Z Abstract Snappy interface and seperate it from Compressor interface --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Discussion regrading design of data load after kettle removal.
Hi Ravindra, I have following questions: 1. How does DataLoadProcessorStep inteface work? For each step, it will call its child step to execute and apply its logic to the returned iterator of the child? And how does it map to OutputFormat in hadoop interface? 2. This step interface relies on iterator to do the encoding row by row, will it be convinient to add batch encoder support now or later? 3. for the ditionary part, besides generator I think it is better also considering the interface for the reading of dictionary while querying. Are you planning to use the same interface? If so, it is not just a Generator. If the dictionary interface is well designed, other developer can also add new dictionary type. For example: - based on usage frequency to assign dictionary value, for better compression, similar to huffman encoding - order-preserving dictionary which can do range filter on dictionary value directly Regards, Jacky -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-regrading-design-of-data-load-after-kettle-removal-tp1672p1726.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
[jira] [Created] (CARBONDATA-295) Abstract Snappy interface and seperate it from Compressor interface
zhangshunyu created CARBONDATA-295: -- Summary: Abstract Snappy interface and seperate it from Compressor interface Key: CARBONDATA-295 URL: https://issues.apache.org/jira/browse/CARBONDATA-295 Project: CarbonData Issue Type: Improvement Components: data-load Affects Versions: 0.1.1-incubating Reporter: zhangshunyu Assignee: zhangshunyu Priority: Minor Fix For: 0.2.0-incubating Currently, we only have snappy compressor who extends form Compressor interface, for future expansion, we need to abstract Snappy interface and seperate it from Compressor interface, it means Compressor interface is the parent of all compressors, and SnappyCompressor interface and the other compressor's interface(or abstract class) should extends Compressor interface, as to different data type for different compressor, it would extend its own interface/abstract class. for example: Compressor -> SnappyCompressor -> SnappyDoubleCompression. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-carbondata pull request #226: [CARBONDATA-294]Fix timestamp data e...
GitHub user lion-x opened a pull request: https://github.com/apache/incubator-carbondata/pull/226 [CARBONDATA-294]Fix timestamp data error # Why raise this PR? In some Examples and testcases, **CarbonCommonConstants.CARBON_TIMESTAMP_FORMAT** is assigned a wrong timestamp format "/mm/dd". This wrong format will cause that Month is set a default value 1. for example, 2015/07/23 will be set as 2015/01/23 00:07:xx.xxx . The right timestamp format should be /MM/dd. This PR fix the wrong uses in some example files and testcase files. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lion-x/incubator-carbondata timeError Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/226.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #226 commit 4ae904f881766ab0990132e8be6fc6d7cfaf72a8 Author: lion-x Date: 2016-10-10T11:17:35Z Fixtimestamperror --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Discussion regrading design of data load after kettle removal.
Hi Jacky, https://drive.google.com/open?id=0B4TWTVbFSTnqeElyWko5NDlBZkdxS3NrMW1PZndzMG5ZM2Y0 1. Yes it calls child step to execute and apply its logic to return iterator just like spark sql. For CarbonOutputFormat it will use RecordBufferedWriterIterator and collects the data in batches. https://drive.google.com/open?id=0B4TWTVbFSTnqTF85anlDOUQ5S1BqYzFpLWcwZnBLSVVqSWpj 2. Yes,this interface relies on processing row by row. But we can also execute in batches in iterator. 3.Yes, dictionary interface is used for reading dictionary while querying. Ok based on my understanding I have added this interface, we can discuss more on it and update the interface. Regards, Ravi On 10 October 2016 at 14:56, Jacky Li wrote: > Hi Ravindra, > > I have following questions: > > 1. How does DataLoadProcessorStep inteface work? For each step, it will > call > its child step to execute and apply its logic to the returned iterator of > the child? And how does it map to OutputFormat in hadoop interface? > > 2. This step interface relies on iterator to do the encoding row by row, > will it be convinient to add batch encoder support now or later? > > 3. for the ditionary part, besides generator I think it is better also > considering the interface for the reading of dictionary while querying. Are > you planning to use the same interface? If so, it is not just a Generator. > If the dictionary interface is well designed, other developer can also add > new dictionary type. For example: > - based on usage frequency to assign dictionary value, for better > compression, similar to huffman encoding > - order-preserving dictionary which can do range filter on dictionary value > directly > > Regards, > Jacky > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/Discussion- > regrading-design-of-data-load-after-kettle-removal-tp1672p1726.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. > -- Thanks & Regards, Ravi
RE: Discussion about using multi local directorys to improve dataloading perfomance
Agree, help boost performance. Jenny -Original Message- From: Jacky Li [mailto:jacky.li...@qq.com] Sent: Saturday, October 08, 2016 9:09 AM To: dev@carbondata.incubator.apache.org Subject: Re: Discussion about using multi local directorys to improve dataloading perfomance Yes, I think it is a good feature to have. Please feel free to create JIRA issue and Pull Request. Regards, Jacky > 在 2016年10月9日,上午12:04,caiqiang 写道: > > Hi All, > For each dataloading, we write the sorted temp files into only one different > local directory. I think this is a bottle neck of dataloading. It is > neccessary to use multi local directorys in multi disks for each dataloading > to improve dataloading performance.
[GitHub] incubator-carbondata pull request #225: [CARBONDATA-295]Abstract Compressor ...
Github user Zhangshunyu closed the pull request at: https://github.com/apache/incubator-carbondata/pull/225 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.
Ravindra Pesala created CARBONDATA-296: -- Summary: 1.Add CSVInputFormat to read csv files. Key: CARBONDATA-296 URL: https://issues.apache.org/jira/browse/CARBONDATA-296 Project: CarbonData Issue Type: Sub-task Reporter: Ravindra Pesala Add CSVInputFormat to read csv files, it should use Univocity parser to read csv files to get optimal performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CARBONDATA-297) 2. Add interfaces for data loading.
Ravindra Pesala created CARBONDATA-297: -- Summary: 2. Add interfaces for data loading. Key: CARBONDATA-297 URL: https://issues.apache.org/jira/browse/CARBONDATA-297 Project: CarbonData Issue Type: Sub-task Reporter: Ravindra Pesala Add the major interface classes for data loading so that the following jiras can use this interfaces to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CARBONDATA-298) 3. Add InputProcessorStep which should iterate recordreader and parse the data as per the data type.
Ravindra Pesala created CARBONDATA-298: -- Summary: 3. Add InputProcessorStep which should iterate recordreader and parse the data as per the data type. Key: CARBONDATA-298 URL: https://issues.apache.org/jira/browse/CARBONDATA-298 Project: CarbonData Issue Type: Sub-task Reporter: Ravindra Pesala Add InputProcessorStep which should iterate recordreader/RecordBufferedWriter and parse the data as per the data types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CARBONDATA-299) 4. Add dictionary generator interfaces and give implementation for pre created dictionary.
Ravindra Pesala created CARBONDATA-299: -- Summary: 4. Add dictionary generator interfaces and give implementation for pre created dictionary. Key: CARBONDATA-299 URL: https://issues.apache.org/jira/browse/CARBONDATA-299 Project: CarbonData Issue Type: Sub-task Reporter: Ravindra Pesala Add dictionary generator interfaces and give implementation for pre-created dictionary(which is generated separetly). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CARBONDATA-300) 5. Add EncodeProcessorStep which encodes the data with dictionary.
Ravindra Pesala created CARBONDATA-300: -- Summary: 5. Add EncodeProcessorStep which encodes the data with dictionary. Key: CARBONDATA-300 URL: https://issues.apache.org/jira/browse/CARBONDATA-300 Project: CarbonData Issue Type: Sub-task Reporter: Ravindra Pesala Add EncodeProcessorStep which encodes the data with dictionary.This dictionary can be obtained from dictionary interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CARBONDATA-301) 6. Add SortProcessorStep which sorts the data as per dimension order and write the sorted files to temp location.
Ravindra Pesala created CARBONDATA-301: -- Summary: 6. Add SortProcessorStep which sorts the data as per dimension order and write the sorted files to temp location. Key: CARBONDATA-301 URL: https://issues.apache.org/jira/browse/CARBONDATA-301 Project: CarbonData Issue Type: Sub-task Reporter: Ravindra Pesala Add SortProcessorStep which sorts the data as per dimension order and write the sorted files to temp location. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CARBONDATA-302) 7. Add DataWriterProcessorStep which reads the data from sort temp files and creates carbondata files.
Ravindra Pesala created CARBONDATA-302: -- Summary: 7. Add DataWriterProcessorStep which reads the data from sort temp files and creates carbondata files. Key: CARBONDATA-302 URL: https://issues.apache.org/jira/browse/CARBONDATA-302 Project: CarbonData Issue Type: Sub-task Reporter: Ravindra Pesala Add DataWriterProcessorStep which reads the data from sort temp files and merge sort it, and apply mdk generator on key and creates carbondata files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CARBONDATA-303) 8. Add CarbonTableOutpuFormat to write data to carbon.
Ravindra Pesala created CARBONDATA-303: -- Summary: 8. Add CarbonTableOutpuFormat to write data to carbon. Key: CARBONDATA-303 URL: https://issues.apache.org/jira/browse/CARBONDATA-303 Project: CarbonData Issue Type: Sub-task Reporter: Ravindra Pesala Add CarbonTableOutpuFormat to write data to carbon. It should use DataProcessorStep interface to load the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-carbondata pull request #219: [CARBONDATA-37]Support different tim...
Github user Zhangshunyu commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/219#discussion_r82720156 --- Diff: processing/src/main/java/org/apache/carbondata/processing/surrogatekeysgenerator/csvbased/CarbonCSVBasedSeqGenStep.java --- @@ -1171,6 +1171,14 @@ else if(isComplexTypeColumn[j]) { DirectDictionaryGenerator directDictionaryGenerator1 = DirectDictionaryKeyGeneratorFactory .getDirectDictionaryGenerator(details.getColumnType()); + String[] timeformats = meta.timeFormat.split(","); + for(String timeformat:timeformats){ --- End diff -- Style, need space: 'for(' => 'for (' , the same to '){' => ') {', and some other places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #215: [WIP][CARBONDATA-2] Remove kettle fr...
Github user ravipesala closed the pull request at: https://github.com/apache/incubator-carbondata/pull/215 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #219: [CARBONDATA-37]Support different tim...
Github user QiangCai commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/219#discussion_r82719262 --- Diff: core/src/main/java/org/apache/carbondata/core/keygenerator/directdictionary/timestamp/TimeStampDirectDictionaryGenerator.java --- @@ -117,15 +117,24 @@ private TimeStampDirectDictionaryGenerator() { * @return dictionary value */ @Override public int generateDirectSurrogateKey(String memberStr) { -SimpleDateFormat timeParser = new SimpleDateFormat(CarbonProperties.getInstance() -.getProperty(CarbonCommonConstants.CARBON_TIMESTAMP_FORMAT, -CarbonCommonConstants.CARBON_TIMESTAMP_DEFAULT_FORMAT)); +String timeString; +String formatString; +if (memberStr.contains(CarbonCommonConstants.COLON_SPC_CHARACTER)){ --- End diff -- What is the reason the data contain COLON_SPC_CHARACTER? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #219: [CARBONDATA-37]Support different tim...
Github user QiangCai commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/219#discussion_r82720651 --- Diff: hadoop/src/test/java/org/apache/carbondata/hadoop/test/util/StoreCreator.java --- @@ -356,6 +356,7 @@ public static void executeGraph(LoadModel loadModel, String storeLocation, Strin schmaModel.setEscapeCharacter("\\"); schmaModel.setQuoteCharacter("\""); schmaModel.setCommentCharacter("#"); + schmaModel.setTimeFormat(CarbonCommonConstants.CARBON_TIMESTAMP_DEFAULT_FORMAT); --- End diff -- No need to modify this file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #219: [CARBONDATA-37]Support different tim...
Github user QiangCai commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/219#discussion_r82720457 --- Diff: processing/src/main/java/org/apache/carbondata/processing/surrogatekeysgenerator/csvbased/CarbonCSVBasedSeqGenStep.java --- @@ -1171,6 +1171,14 @@ else if(isComplexTypeColumn[j]) { DirectDictionaryGenerator directDictionaryGenerator1 = DirectDictionaryKeyGeneratorFactory .getDirectDictionaryGenerator(details.getColumnType()); --- End diff -- If the column type is TimeStamp, please provide dateformat to KeyGenerator. Better to provide different key generator for each date format. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #219: [CARBONDATA-37]Support different tim...
Github user QiangCai commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/219#discussion_r82719466 --- Diff: core/src/main/java/org/apache/carbondata/core/keygenerator/directdictionary/timestamp/TimeStampDirectDictionaryGenerator.java --- @@ -117,15 +117,24 @@ private TimeStampDirectDictionaryGenerator() { * @return dictionary value */ @Override public int generateDirectSurrogateKey(String memberStr) { -SimpleDateFormat timeParser = new SimpleDateFormat(CarbonProperties.getInstance() -.getProperty(CarbonCommonConstants.CARBON_TIMESTAMP_FORMAT, -CarbonCommonConstants.CARBON_TIMESTAMP_DEFAULT_FORMAT)); +String timeString; --- End diff -- please use word "date" instead of "time" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #219: [CARBONDATA-37]Support different tim...
Github user QiangCai commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/219#discussion_r82720605 --- Diff: processing/src/main/java/org/apache/carbondata/processing/surrogatekeysgenerator/csvbased/CarbonCSVBasedSeqGenStep.java --- @@ -1171,6 +1171,14 @@ else if(isComplexTypeColumn[j]) { DirectDictionaryGenerator directDictionaryGenerator1 = DirectDictionaryKeyGeneratorFactory .getDirectDictionaryGenerator(details.getColumnType()); + String[] timeformats = meta.timeFormat.split(","); + for(String timeformat:timeformats){ +if(timeformat.startsWith(details.getColumnName())){ + timeformat = timeformat.replaceFirst(":", + CarbonCommonConstants.COLON_SPC_CHARACTER); + tuple = timeformat.replace(details.getColumnName(), tuple); +} + } --- End diff -- better to not modify tuple value --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #219: [CARBONDATA-37]Support different tim...
Github user lion-x commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/219#discussion_r82722044 --- Diff: core/src/main/java/org/apache/carbondata/core/keygenerator/directdictionary/timestamp/TimeStampDirectDictionaryGenerator.java --- @@ -117,15 +117,24 @@ private TimeStampDirectDictionaryGenerator() { * @return dictionary value */ @Override public int generateDirectSurrogateKey(String memberStr) { -SimpleDateFormat timeParser = new SimpleDateFormat(CarbonProperties.getInstance() -.getProperty(CarbonCommonConstants.CARBON_TIMESTAMP_FORMAT, -CarbonCommonConstants.CARBON_TIMESTAMP_DEFAULT_FORMAT)); +String timeString; +String formatString; +if (memberStr.contains(CarbonCommonConstants.COLON_SPC_CHARACTER)){ --- End diff -- because in some format like -XX-XX 00:00:00.000, it has colon, it will make mistake when separating the memberstring. for example member string like, 2016-08-11 00:00:00.000:-MM-dd HH.mm.ss.SSS --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[ANNOUNCE] Apache CarbonData 0.1.1-incubating Release
Hi, The Apache CarbonData team would like to announce the release of Apache CarbonData 0.1.1-incubating. Apache CarbonData(incubating) is a new big data file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency. The release artifacts can be downloaded here: https://dist.apache.org/repos/dist/release/incubator/carbondata/0.1.1-incubating/ Maven artifacts have been made available here: https://repository.apache.org/content/repositories/releases/org/apache/carbondata The release notes can be found here: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12338021 Regards Liang
[GitHub] incubator-carbondata pull request #224: [CARBONDATA-239]Add scan_blocklet_nu...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/224#discussion_r82729006 --- Diff: core/src/main/java/org/apache/carbondata/scan/processor/AbstractDataBlockIterator.java --- @@ -127,11 +133,15 @@ protected boolean updateScanner() { } } - private AbstractScannedResult getNextScannedResult() throws QueryExecutionException { + private AbstractScannedResult getNextScannedResult(QueryStatisticsRecorder recorder, --- End diff -- Why we need to change this getNextScannedResult() method parameters. if required please pass a statistics model, this will make sure that our method parameters wont grow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---