[GitHub] incubator-carbondata pull request #313: [CARBONDATA-405]Fixed Data load fail...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/313#discussion_r8753 --- Diff: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/dataframe/DataFrameTestCase.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.carbondata.spark.testsuite.dataframe + +import java.io.File + +import org.apache.spark.sql.{DataFrame, Row, SaveMode} +import org.apache.spark.sql.common.util.CarbonHiveContext._ +import org.apache.spark.sql.common.util.{CarbonHiveContext, QueryTest} +import org.scalatest.BeforeAndAfterAll + +/** + * Test Class for hadoop fs relation --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: GC problem and performance refine problem
Hi Kumar Vishal, 1. Create table ddl: CREATE TABLE IF NOT EXISTS Table1 (* h Int, g Int, d String, f Int, e Int,* a Int, b Int, …(extra near 300 columns) STORED BY 'org.apache.carbondata.format' TBLPROPERTIES( "NO_INVERTED_INDEX”=“a”, "NO_INVERTED_INDEX”=“b”, …(extra near 300 columns) "DICTIONARY_INCLUDE”=“a”, "DICTIONARY_INCLUDE”=“b”, …(extra near 300 columns) ) 2. 3. There more than hundreds node in the cluster, but cluster is used mixed with other application. Some time when node is enough, we will get 100 distinct node. 4. I give a statistic of task time during once query and mark distinct nodes below: [image: 内嵌图片 1] 2016-11-10 23:52 GMT+08:00 Kumar Vishal: > Hi Anning Luo, > > Can u please provide below details. > > 1.Create table ddl. > 2.Number of node in you cluster setup. > 3. Number of executors per node. > 4. Query statistics. > > Please find my comments in bold. > > Problem: > 1. GC problem. We suffer a 20%~30% GC time for > some task in first stage after a lot of parameter refinement. We now use G1 > GC in java8. GC time will double if use CMS. The main GC time is spent on > young generation GC. Almost half memory of young generation will be copy to > old generation. It seems lots of object has a long life than GC period and > the space is not be reuse(as concurrent GC will release it later). When we > use a large Eden(>=1G for example), once GC time will be seconds. If set > Eden little(256M for example), once GC time will be hundreds milliseconds, > but more frequency and total is still seconds. Is there any way to lessen > the GC time? (We don’t consider the first query and second query in this > case.) > > *How many node are present in your cluster setup?? If nodes are less please > reduce the number of executors per node.* > > 2. Performance refine problem. Row number after > being filtered is not uniform. Some node maybe heavy. It spend more time > than other node. The time of one task is 4s ~ 16s. Is any method to refine > it? > > 3. Too long time for first and second query. I > know dictionary and some index need to be loaded for the first time. But > after I trying use query below to preheat it, it still spend a lot of time. > How could I preheat the query correctly? > select Aarray, a, b, c… from Table1 where Aarray is > not null and d = “sss” and e !=22 and f = 33 and g = 44 and h = 55 > > *Currently we are working on first time query improvement. For now you can > run select count(*) or count(column), so all the blocks get loaded and then > you can run the actual query.* > > 4. Any other suggestion to lessen the query time? > > > Some suggestion: > The log by class QueryStatisticsRecorder give me a good means > to find the neck bottle, but not enough. There still some metric I think is > very useful: > 1. filter ratio. i.e.. not only result_size but also the origin > size so we could know how many data is filtered. > 2. IO time. The scan_blocks_time is not enough. If it is high, > we know somethings wrong, but not know what cause that problem. The real IO > time for data is not be provided. As there may be several file for one > partition, know the program slow is caused by datanode or executor itself > give us intuition to find the problem. > 3. The TableBlockInfo for task. I log it by myself when > debugging. It tell me how many blocklets is locality. The spark web monitor > just give a locality level, but may be only one blocklet is locality. > > > -Regards > Kumar Vishal > > On Thu, Nov 10, 2016 at 8:55 PM, An Lan wrote: > > > Hi, > > > > We are using carbondata to build our table and running query in > > CarbonContext. We have some performance problem during refining the > system. > > > > *Background*: > > > > *cluster*: 100 executor,5 task/executor, 10G > > memory/executor > > > > *data*: 60+GB(per one replica) as carbon > data > > format, 600+MB/file * 100 file, 300+columns, 300+million rows > > > > *sql example:* > > > > select A, > > > > sum(a), > > > > sum(b), > > > > sum(c), > > > > …( extra 100 aggregation like > > sum(column)) > > > > from Table1 LATERAL VIEW > > explode(split(Aarray, ‘*;*’)) ATable AS A > > > > where A is not null and d > > “ab:c-10” > > and d < “h:0f3s” and e!=10 and f=22 and g=33 and h=44 GROUP BY A > > > > *target query time*: <10s > > > > *current query time*: 15s ~ 25s > > > > *scene:* OLAP system. <100 queries every day. > > Concurrency number is
Re: [VOTE] Apache CarbonData 0.2.0-incubating release
+1 (binding) Regards JB On 11/10/2016 12:17 AM, Liang Chen wrote: Hi all, I submit the CarbonData 0.2.0-incubating to your vote. Release Notes: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12337896 Staging Repository: https://repository.apache.org/content/repositories/orgapachecarbondata-1006 Git Tag: carbondata-0.2.0-incubating Please vote to approve this release: [ ] +1 Approve the release [ ] -1 Don't approve the release (please provide specific comments) This vote will be open for at least 72 hours. If this vote passes (we need at least 3 binding votes, meaning three votes from the PPMC), I will forward to gene...@incubator.apache.org for the IPMC votes. Here is my vote : +1 (binding) Regards Liang -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com
[GitHub] incubator-carbondata pull request #305: [CARBONDATA-393] implement test case...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/305#discussion_r87540775 --- Diff: core/src/test/java/org/apache/carbondata/core/keygenerator/mdkey/NumberCompressorUnitTest.java --- @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.carbondata.core.keygenerator.mdkey; + +import org.junit.Test; + +import static junit.framework.Assert.assertEquals; + + +public class NumberCompressorUnitTest { + +private NumberCompressor numberCompressor; + + +@Test +public void testCompress() throws Exception { +int cardinality = 10; +numberCompressor = new NumberCompressor(cardinality); +byte[] expected = new byte[]{2, 86, 115}; +int[] keys = new int[]{2, 5, 6, 7, 3}; +byte[] result = numberCompressor.compress(keys); +for (int i = 0; i < result.length; i++) { +assertEquals(expected[i], result[i]); +} +} + +@Test --- End diff -- Test with boundary and negative conditions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #262: [CARBONDATA-308] Use CarbonInputForm...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/262#discussion_r87540836 --- Diff: processing/src/main/java/org/apache/carbondata/lcm/status/SegmentStatusManager.java --- @@ -177,6 +178,13 @@ public ValidAndInvalidSegmentsInfo getValidAndInvalidSegments() throws IOExcepti } } + +// remove entry in the segment index if there are invalid segments +if (listOfInvalidSegments.size() > 0) { --- End diff -- ok, modified --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #262: [CARBONDATA-308] Use CarbonInputForm...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/262#discussion_r87540816 --- Diff: processing/src/main/java/org/apache/carbondata/lcm/status/SegmentStatusManager.java --- @@ -177,6 +178,13 @@ public ValidAndInvalidSegmentsInfo getValidAndInvalidSegments() throws IOExcepti } } + +// remove entry in the segment index if there are invalid segments +if (listOfInvalidSegments.size() > 0) { --- End diff -- ok, modified --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #305: [CARBONDATA-393] implement test case...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/305#discussion_r87540550 --- Diff: core/src/test/java/org/apache/carbondata/core/keygenerator/mdkey/BitsUnitTest.java --- @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.carbondata.core.keygenerator.mdkey; + + +import org.junit.Test; +import static org.hamcrest.CoreMatchers.equalTo; +import static org.hamcrest.MatcherAssert.assertThat; +import static org.hamcrest.core.Is.is; + +public class BitsUnitTest { +private Bits bits; + +@Test +public void testGetKeyByteOffsets() throws Exception { +int[] lens = new int[]{1, 2, 3}; +bits = new Bits(lens); +int index = 2; +int[] expected = new int[]{0, 0}; +int[] result = bits.getKeyByteOffsets(index); +assertThat(result, is(equalTo(expected))); +} + +@Test +public void testGetWithIntKeys() throws Exception { +int[] lens = new int[]{20, 35, 10}; +bits = new Bits(lens); +long[] expected = new long[]{703687441812490L, 0}; --- End diff -- test with negative and boundary conditions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #305: [CARBONDATA-393] implement test case...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/305#discussion_r87540505 --- Diff: core/src/test/java/org/apache/carbondata/core/keygenerator/mdkey/BitsUnitTest.java --- @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.carbondata.core.keygenerator.mdkey; + + +import org.junit.Test; +import static org.hamcrest.CoreMatchers.equalTo; +import static org.hamcrest.MatcherAssert.assertThat; +import static org.hamcrest.core.Is.is; + +public class BitsUnitTest { +private Bits bits; + +@Test +public void testGetKeyByteOffsets() throws Exception { +int[] lens = new int[]{1, 2, 3}; --- End diff -- Add more testcases with big values and also cover the boundary conditions in test cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #305: [CARBONDATA-393] implement test case...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/305#discussion_r87540197 --- Diff: core/src/test/java/org/apache/carbondata/core/keygenerator/columnar/impl/MultiDimKeyVarLengthVariableSplitGeneratorUnitTest.java --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.carbondata.core.keygenerator.columnar.impl; + + +import org.junit.Before; +import org.junit.Test; +import static junit.framework.Assert.assertEquals; +import java.util.Arrays; + +public class MultiDimKeyVarLengthVariableSplitGeneratorUnitTest { + +private MultiDimKeyVarLengthVariableSplitGenerator multiDimKeyVarLengthVariableSplitGenerator; + +@Before +public void setup() { +int[] lens = new int[]{1, 2, 3, 4, 5, 7, 8, 9, 0, 9, 8, 7, 6, 5, 4, 3}; +int[] dimSplit = new int[]{50, 30}; --- End diff -- Here we should give proper `dimSplit` and add more testcases --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #303: [CARBONDATA-386] Unit test case for ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/303#discussion_r87531522 --- Diff: core/src/test/java/org/apache/carbondata/core/util/CarbonMetadataUtilTest.java --- @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.carbondata.core.util; + +import mockit.Mock; +import mockit.MockUp; +import org.apache.carbondata.core.carbon.metadata.blocklet.index.BlockletBTreeIndex; +import org.apache.carbondata.core.carbon.metadata.blocklet.index.BlockletIndex; +import org.apache.carbondata.core.carbon.metadata.blocklet.index.BlockletMinMaxIndex; +import org.apache.carbondata.core.carbon.metadata.index.BlockIndexInfo; +import org.apache.carbondata.core.metadata.BlockletInfoColumnar; +import org.apache.carbondata.format.BlockIndex; +import org.apache.carbondata.format.ColumnSchema; +import org.apache.carbondata.format.IndexHeader; +import org.apache.carbondata.format.SegmentInfo; +import org.junit.Test; + +import java.util.ArrayList; +import java.util.List; + +import static junit.framework.TestCase.*; +import static org.apache.carbondata.core.util.CarbonMetadataUtil.getBlockIndexInfo; +import static org.apache.carbondata.core.util.CarbonMetadataUtil.getIndexHeader; + +public class CarbonMetadataUtilTest { + --- End diff -- There are many methods to cover in CarbonMetadataUtil --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #295: [Carbondata-379] Scan package's unit...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/295#discussion_r87529444 --- Diff: core/src/test/java/org/apache/carbondata/scan/result/impl/NonFilterQueryScannedResultTest.java --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.carbondata.scan.result.impl; + +import mockit.Mock; +import mockit.MockUp; +import org.apache.carbondata.scan.executor.infos.BlockExecutionInfo; +import org.apache.carbondata.scan.model.QueryDimension; +import org.apache.carbondata.scan.result.AbstractScannedResult; +import org.junit.Before; +import org.junit.Test; + +public class NonFilterQueryScannedResultTest { +private static NonFilterQueryScannedResult nonFilterQueryScannedResult; + +@Before +public void setUp(){ +BlockExecutionInfo blockExecutionInfo = new BlockExecutionInfo(); +QueryDimension queryDimension[] = {new QueryDimension("dummyColumnName1"),new QueryDimension("dummyColumnName2")}; +blockExecutionInfo.setQueryDimensions(queryDimension); +nonFilterQueryScannedResult = new NonFilterQueryScannedResult(blockExecutionInfo); + +} + +@Test +public void testIsNullMeasureValue(){ --- End diff -- this test is doing nothing, please mock it properly. We supposed to set `measureDataChunks` and call this method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: As planed, we are ready to make Apache CarbonData 0.2.0 release:
+1 Regards bill.zhou Liang Chen wrote > Hi all > > In 0.2.0 version of CarbonData, there are major performance improvements > like blocklets distribution, support BZIP2 compressed files, and so on > added to enhance the CarbonData performance significantly. Along with > performance improvement, there are new features added to enhance > compatibility and usability of CarbonData like remove thrift compiler > dependency. > > > I can be this release manager, can JB guide me to finish this release? > > Thanks. > > > Regards > Liang -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/As-planed-we-are-ready-to-make-Apache-CarbonData-0-2-0-release-tp2738p2861.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
Re: [VOTE] Apache CarbonData 0.2.0-incubating release
+1 binding On 11/11/2016 02:37, Venkata Gollamudi wrote: +1 Regards, Ramana On Thu, Nov 10, 2016, 10:03 PM Jacky Liwrote: > +1 binding > > Regards, > Jacky > > ---Original--- > From: "Aniket Adnaik" > Date: 2016/11/10 14:43:49 > To: "dev" ;"chenliang613"< > chenliang...@apache.org>; > Subject: Re: [VOTE] Apache CarbonData 0.2.0-incubating release > > > +1 > > Regards, > Aniket > > On 9 Nov 2016 3:17 p.m., "Liang Chen" wrote: > > > Hi all, > > > > I submit the CarbonData 0.2.0-incubating to your vote. > > > > Release Notes: > > https://issues.apache.org/jira/secure/ReleaseNote.jspa? > > projectId=12320220=12337896 > > > > Staging Repository: > > https://repository.apache.org/content/repositories/ > > orgapachecarbondata-1006 > > > > Git Tag: > > carbondata-0.2.0-incubating > > > > Please vote to approve this release: > > [ ] +1 Approve the release > > [ ] -1 Don't approve the release (please provide specific comments) > > > > This vote will be open for at least 72 hours. If this vote passes (we > need > > at least 3 binding votes, meaning three votes from the PPMC), I will > > forward to gene...@incubator.apache.org for the IPMC votes. > > > > Here is my vote : +1 (binding) > > > > Regards > > Liang > >
[GitHub] incubator-carbondata pull request #295: [Carbondata-379] Scan package's unit...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/295#discussion_r87527367 --- Diff: core/src/main/java/org/apache/carbondata/scan/complextypes/PrimitiveQueryType.java --- @@ -166,6 +166,7 @@ public PrimitiveQueryType(String name, String parentname, int blockIndex, DirectDictionaryGenerator directDictionaryGenerator = DirectDictionaryKeyGeneratorFactory .getDirectDictionaryGenerator(dataType); actualData = directDictionaryGenerator.getValueFromSurrogate(surrgateValue); + --- End diff -- Please don't add space --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #277: [CARBONDATA-357] Add unit test for V...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/277#discussion_r87527053 --- Diff: core/src/test/java/org/apache/carbondata/core/util/ValueCompressionUtilTest.java --- @@ -0,0 +1,546 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.carbondata.core.util; + +import org.apache.carbondata.core.datastorage.store.compression.ValueCompressonHolder; +import org.apache.carbondata.core.datastorage.store.compression.type.*; +import org.junit.Test; + +import java.nio.ByteBuffer; + +import static junit.framework.TestCase.*; +import static org.apache.carbondata.core.util.ValueCompressionUtil.DataType; + +public class ValueCompressionUtilTest { + +@Test +public void testGetSize() { +DataType[] dataTypes = {DataType.DATA_BIGINT,DataType.DATA_INT,DataType.DATA_BYTE,DataType.DATA_SHORT,DataType.DATA_FLOAT}; +int[] expectedSizes = {8,4,1,2,4}; +for(int i =0; i < dataTypes.length; i++) { + assertEquals(expectedSizes[i],ValueCompressionUtil.getSize(dataTypes[i])); +} +} + +@Test +public void testToGetCompressedValuesWithCompressionTypeMin_MaxForDataInt() { +double[] values = {20.121,21.223,22.345}; +int[] result = (int[]) ValueCompressionUtil.getCompressedValues(ValueCompressionUtil.COMPRESSION_TYPE.MAX_MIN,values,DataType.DATA_INT,22.3,3); +int[] expectedResult = {2,1,0}; +for(int i=0; i < values.length; i++) { +assertEquals(result[i], expectedResult[i]); +} +} + +@Test +public void testToGetCompressedValuesWithCompressionTypeMin_MaxForDataByte() { +double[] values = {20.121,21.223,22.345}; +byte[] result = (byte[]) ValueCompressionUtil.getCompressedValues(ValueCompressionUtil.COMPRESSION_TYPE.MAX_MIN,values,DataType.DATA_BYTE,22.345,3); +byte[] expectedResult = {2,1,0}; +for(int i=0; i < values.length; i++) { +assertEquals(result[i], expectedResult[i]); +} +} + +@Test +public void testToGetCompressedValuesWithCompressionTypeMin_MaxForDataShort() { +double[] values = {200.121,21.223,22.345}; +short[] result = (short[]) ValueCompressionUtil.getCompressedValues(ValueCompressionUtil.COMPRESSION_TYPE.MAX_MIN,values,DataType.DATA_SHORT,22.345,3); +short[] expectedResult = {-177,1,0}; +for(int i=0; i < values.length; i++) { +assertEquals(result[i], expectedResult[i]); +} +} + +@Test +public void testToGetCompressedValuesWithCompressionTypeMin_MaxForDataLong() { +double[] values = {20.121,21.223,22.345}; +long[] result = (long[]) ValueCompressionUtil.getCompressedValues(ValueCompressionUtil.COMPRESSION_TYPE.MAX_MIN,values,DataType.DATA_LONG,22.345,3); +long[] expectedResult = {2,1,0}; +for(int i=0; i < values.length; i++) { +assertEquals(result[i], expectedResult[i]); +} +} + +@Test +public void testToGetCompressedValuesWithCompressionTypeMin_MaxForDataFloat() { +double[] values = {20.121,21.223,22.345}; +float[] result = (float[]) ValueCompressionUtil.getCompressedValues(ValueCompressionUtil.COMPRESSION_TYPE.MAX_MIN,values,DataType.DATA_FLOAT,22.345,3); +float[] expectedResult = {2.224f,1.122f,0f}; +for(int i=0; i < values.length; i++) { +assertEquals(result[i], expectedResult[i]); +} +} + +@Test +public void testToGetCompressedValuesWithCompressionTypeMin_MaxForDataDouble() { +double[] values = {20.121,21.223,22.345}; +double[] result = (double[]) ValueCompressionUtil.getCompressedValues(ValueCompressionUtil.COMPRESSION_TYPE.MAX_MIN,values,DataType.DATA_DOUBLE,102.345,3); +
[GitHub] incubator-carbondata pull request #277: [CARBONDATA-357] Add unit test for V...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/277#discussion_r87526104 --- Diff: core/src/test/java/org/apache/carbondata/core/util/ValueCompressionUtilTest.java --- @@ -0,0 +1,546 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.carbondata.core.util; + +import org.apache.carbondata.core.datastorage.store.compression.ValueCompressonHolder; +import org.apache.carbondata.core.datastorage.store.compression.type.*; +import org.junit.Test; + +import java.nio.ByteBuffer; + +import static junit.framework.TestCase.*; +import static org.apache.carbondata.core.util.ValueCompressionUtil.DataType; + +public class ValueCompressionUtilTest { + +@Test +public void testGetSize() { +DataType[] dataTypes = {DataType.DATA_BIGINT,DataType.DATA_INT,DataType.DATA_BYTE,DataType.DATA_SHORT,DataType.DATA_FLOAT}; +int[] expectedSizes = {8,4,1,2,4}; +for(int i =0; i < dataTypes.length; i++) { + assertEquals(expectedSizes[i],ValueCompressionUtil.getSize(dataTypes[i])); +} +} + +@Test +public void testToGetCompressedValuesWithCompressionTypeMin_MaxForDataInt() { +double[] values = {20.121,21.223,22.345}; +int[] result = (int[]) ValueCompressionUtil.getCompressedValues(ValueCompressionUtil.COMPRESSION_TYPE.MAX_MIN,values,DataType.DATA_INT,22.3,3); --- End diff -- The values which are passed are wrong, you are passing decimal values and type passed as `MAX_MIN`, it does not return right result. Please pass the proper values depends up on the compression type. Please change for other datatypes as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #313: [CARBONDATA-405]Fixed Data load fail...
Github user Jay357089 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/313#discussion_r87524058 --- Diff: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/dataframe/DataFrameTestCase.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.carbondata.spark.testsuite.dataframe + +import java.io.File + +import org.apache.spark.sql.{DataFrame, Row, SaveMode} +import org.apache.spark.sql.common.util.CarbonHiveContext._ +import org.apache.spark.sql.common.util.{CarbonHiveContext, QueryTest} +import org.scalatest.BeforeAndAfterAll + +/** + * Test Class for hadoop fs relation --- End diff -- the comment is not proper... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [VOTE] Apache CarbonData 0.2.0-incubating release
+1 bingding jinzhu 在 2016/11/11 0:33, Jacky Li 写道: +1 binding Regards, Jacky ---Original--- From: "Aniket Adnaik"Date: 2016/11/10 14:43:49 To: "dev" ;"chenliang613" ; Subject: Re: [VOTE] Apache CarbonData 0.2.0-incubating release +1 Regards, Aniket On 9 Nov 2016 3:17 p.m., "Liang Chen" wrote: Hi all, I submit the CarbonData 0.2.0-incubating to your vote. Release Notes: https://issues.apache.org/jira/secure/ReleaseNote.jspa? projectId=12320220=12337896 Staging Repository: https://repository.apache.org/content/repositories/ orgapachecarbondata-1006 Git Tag: carbondata-0.2.0-incubating Please vote to approve this release: [ ] +1 Approve the release [ ] -1 Don't approve the release (please provide specific comments) This vote will be open for at least 72 hours. If this vote passes (we need at least 3 binding votes, meaning three votes from the PPMC), I will forward to gene...@incubator.apache.org for the IPMC votes. Here is my vote : +1 (binding) Regards Liang > --- Confidentiality Notice: The information contained in this e-mail and any accompanying attachment(s) is intended only for the use of the intended recipient and may be confidential and/or privileged of Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader of this communication is not the intended recipient, unauthorized use, forwarding, printing, storing, disclosure or copying is strictly prohibited, and may be unlawful.If you have received this communication in error,please immediately notify the sender by return e-mail, and delete the original message and all copies from your system. Thank you. ---
Re: As planed, we are ready to make Apache CarbonData 0.2.0 release:
+1 Regards, Ramana On Thu, Nov 10, 2016, 6:03 AM foryou2030wrote: > +1 > regards > Gin > > 发自我的 iPhone > > > 在 2016年11月10日,上午3:25,Kumar Vishal 写道: > > > > +1 > > -Redards > > Kumar Vishal > > > >> On Nov 9, 2016 08:04, "Jacky Li" wrote: > >> > >> +1 > >> > >> Regards, > >> Jacky > >> > >>> 在 2016年11月9日,上午9:05,Jay <2550062...@qq.com> 写道: > >>> > >>> +1 > >>> regards > >>> Jay > >>> > >>> > >>> > >>> > >>> -- 原始邮件 -- > >>> 发件人: "向志强"; ; > >>> 发送时间: 2016年11月9日(星期三) 上午8:59 > >>> 收件人: "dev" ; > >>> > >>> 主题: Re: As planed, we are ready to make Apache CarbonData 0.2.0 > release: > >>> > >>> > >>> > >>> No need to install thrift for building project is so great. > >>> > >>> 2016-11-08 23:16 GMT+08:00 QiangCai : > >>> > I look forward to release this version. > Carbondata improved query and load performance. And it is a good news > no > need to install thrift for building project. > Btw, How many PR merged into this version? > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/As-planed-we- > are-ready-to-make-Apache-CarbonData-0-2-0-release-tp2738p2752.html > Sent from the Apache CarbonData Mailing List archive mailing list > >> archive > at Nabble.com. > > > >
[GitHub] incubator-carbondata pull request #270: [CARBONDATA-346] Add unit test for C...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/270#discussion_r87453971 --- Diff: core/src/test/java/org/apache/carbondata/core/util/CarbonUtilTest.java --- @@ -18,18 +18,746 @@ */ package org.apache.carbondata.core.util; -import junit.framework.TestCase; +import mockit.Mock; +import mockit.MockUp; +import org.apache.carbondata.core.carbon.datastore.chunk.DimensionChunkAttributes; +import org.apache.carbondata.core.carbon.datastore.chunk.impl.FixedLengthDimensionDataChunk; +import org.apache.carbondata.core.carbon.metadata.blocklet.DataFileFooter; +import org.apache.carbondata.core.carbon.metadata.blocklet.datachunk.DataChunk; +import org.apache.carbondata.core.carbon.metadata.datatype.DataType; +import org.apache.carbondata.core.carbon.metadata.encoder.Encoding; +import org.apache.carbondata.core.carbon.metadata.schema.table.column.CarbonDimension; +import org.apache.carbondata.core.carbon.metadata.schema.table.column.CarbonMeasure; +import org.apache.carbondata.core.carbon.metadata.schema.table.column.ColumnSchema; +import org.apache.carbondata.core.datastorage.store.columnar.ColumnGroupModel; +import org.apache.carbondata.core.datastorage.store.compression.ValueCompressionModel; +import org.apache.carbondata.core.datastorage.store.filesystem.LocalCarbonFile; +import org.apache.carbondata.core.datastorage.store.impl.FileFactory; +import org.apache.carbondata.core.keygenerator.mdkey.NumberCompressor; +import org.apache.carbondata.core.metadata.ValueEncoderMeta; +import org.apache.carbondata.scan.model.QueryDimension; +import org.apache.hadoop.security.UserGroupInformation; +import org.glassfish.grizzly.memory.HeapBuffer; +import org.junit.AfterClass; +import org.junit.BeforeClass; import org.junit.Test; +import org.pentaho.di.core.exception.KettleException; +import java.io.*; +import java.nio.ByteBuffer; +import java.nio.channels.FileChannel; +import java.util.ArrayList; +import java.util.List; +import static junit.framework.TestCase.*; -public class CarbonUtilTest extends TestCase { +public class CarbonUtilTest { - @Test public void testGetBitLengthForDimensionGiveProperValue() { -int[] cardinality = { 10, 1, 1, 1, 2, 3 }; -int[] dimensionBitLength = -CarbonUtil.getDimensionBitLength(cardinality, new int[] { 1, 1, 3, 1 }); -int[] expectedOutPut = { 8, 8, 14, 2, 8, 8 }; -for (int i = 0; i < dimensionBitLength.length; i++) { - assertEquals(expectedOutPut[i], dimensionBitLength[i]); +@BeforeClass +public static void setUp() throws Exception{ +new File("../core/src/test/resources/testFile.txt").createNewFile(); +new File("../core/src/test/resources/testDatabase").mkdirs(); + +} + +@Test +public void testGetBitLengthForDimensionGiveProperValue() { +int[] cardinality = {200, 1, 1, 1, 10, 3}; +int[] dimensionBitLength = +CarbonUtil.getDimensionBitLength(cardinality, new int[]{1, 1, 3, 1}); +int[] expectedOutPut = {8, 8, 14, 2, 8, 8}; +for (int i = 0; i < dimensionBitLength.length; i++) { +assertEquals(expectedOutPut[i], dimensionBitLength[i]); +} +} + +@Test(expected = IOException.class) +public void testCloseStreams() throws IOException { +FileReader stream = new FileReader("../core/src/test/resources/testFile.txt"); +BufferedReader br = new BufferedReader(stream); +CarbonUtil.closeStreams(br); +br.ready(); +} + +@Test +public void testToGetCardinality() { +int result = CarbonUtil.getIncrementedCardinality(10); --- End diff -- add more checks by passing more different values here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #269: [CARBONDATA-345] improve code-covera...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/269#discussion_r87446559 --- Diff: processing/src/test/java/org/apache/carbondata/lcm/locks/ZooKeeperLockingTest.java --- @@ -41,103 +41,103 @@ */ public class ZooKeeperLockingTest { --- End diff -- why this testcase is modified? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #269: [CARBONDATA-345] improve code-covera...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/269#discussion_r87446084 --- Diff: pom.xml --- @@ -6,9 +6,7 @@ The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at --- End diff -- why updation is neede in pom file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #269: [CARBONDATA-345] improve code-covera...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/269#discussion_r87444589 --- Diff: core/src/test/java/org/apache/carbondata/core/cache/dictionary/DictionaryByteArrayWrapperTest.java --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.carbondata.core.cache.dictionary; + +import org.junit.Before; +import org.junit.Test; + +public class DictionaryByteArrayWrapperTest { + +DictionaryByteArrayWrapper dictionaryByteArrayWrapper; + +@Before +public void setup() { +byte[] data = "Rahul".getBytes(); +dictionaryByteArrayWrapper = new DictionaryByteArrayWrapper(data); --- End diff -- Please include test for another constructor which has xxHash32 also --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #267: [CARBONDATA-340] implement test case...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/267#discussion_r87442533 --- Diff: core/src/main/java/org/apache/carbondata/core/load/LoadMetadataDetails.java --- @@ -150,7 +150,7 @@ public String getLoadStartTime() { * return loadStartTime * @return */ - public long getLoadStartTimeAsLong() { + public Long getLoadStartTimeAsLong() { --- End diff -- Why it is required to change to `Long` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #313: [CARBONDATA-405]Fixed Data load fail...
GitHub user ravipesala opened a pull request: https://github.com/apache/incubator-carbondata/pull/313 [CARBONDATA-405]Fixed Data load fail if dataframe is created with LONG datatype column If the dataframe schema has long datatype then carbon table creation is failing because it cannot convert long type to supported bigint type. Same is fixed in this PR You can merge this pull request into a Git repository by running: $ git pull https://github.com/ravipesala/incubator-carbondata dataframe-longtype-issue Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/313.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #313 commit 78d9fe6f32fc010c6dd6115444872abd3b53338d Author: ravipesalaDate: 2016-11-10T16:46:59Z Fixed Data load fail if dataframe is created with LONG datatype column --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [VOTE] Apache CarbonData 0.2.0-incubating release
+1 binding Regards, Jacky ---Original--- From: "Aniket Adnaik"Date: 2016/11/10 14:43:49 To: "dev" ;"chenliang613" ; Subject: Re: [VOTE] Apache CarbonData 0.2.0-incubating release +1 Regards, Aniket On 9 Nov 2016 3:17 p.m., "Liang Chen" wrote: > Hi all, > > I submit the CarbonData 0.2.0-incubating to your vote. > > Release Notes: > https://issues.apache.org/jira/secure/ReleaseNote.jspa? > projectId=12320220=12337896 > > Staging Repository: > https://repository.apache.org/content/repositories/ > orgapachecarbondata-1006 > > Git Tag: > carbondata-0.2.0-incubating > > Please vote to approve this release: > [ ] +1 Approve the release > [ ] -1 Don't approve the release (please provide specific comments) > > This vote will be open for at least 72 hours. If this vote passes (we need > at least 3 binding votes, meaning three votes from the PPMC), I will > forward to gene...@incubator.apache.org for the IPMC votes. > > Here is my vote : +1 (binding) > > Regards > Liang >
[GitHub] incubator-carbondata pull request #296: [CARBONDATA-382]Like Filter Query Op...
Github user kumarvishal09 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/296#discussion_r87425199 --- Diff: core/src/main/java/org/apache/carbondata/scan/filter/FilterExpressionProcessor.java --- @@ -286,6 +289,13 @@ private FilterResolverIntf getFilterResolverBasedOnExpressionType( return new RowLevelFilterResolverImpl(expression, isExpressionResolve, true, tableIdentifier); } +if (currentCondExpression.getFilterExpressionType() == ExpressionType.CONTAINS --- End diff -- For dictionary column do we need to create row level expression?? I think for dictionary column creating a include filter for like query will good enough, because we have the dictionary values we can search in dictionary to get all the valid values and we can apply filter. Please correct me if i am wrong:) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (CARBONDATA-406) Empty Folder is created when data load from dataframe
Babulal created CARBONDATA-406: -- Summary: Empty Folder is created when data load from dataframe Key: CARBONDATA-406 URL: https://issues.apache.org/jira/browse/CARBONDATA-406 Project: CarbonData Issue Type: Bug Components: data-load Affects Versions: 0.1.0-incubating Reporter: Babulal Priority: Trivial Load the data from dataframe to carbon table with tempCSV=false option . Load is success but emtyFolder is getting created in HDFS Cluster size= 3 node . Type:- Stanalone Spark Steps val customSchema = StructType(Array(StructField("imei", StringType, true), StructField("deviceInformationId", IntegerType, true),StructField("mac", StringType, true),StructField("productdate", TimestampType , true), StructField("updatetime", TimestampType, true),StructField("gamePointId", DoubleType, true),StructField("contractNumber", DoubleType, true) )); val df = cc.read.format("com.databricks.spark.csv").option("header", "false").schema(customSchema).load("/opt/data/xyz/100_default_date_11_header.csv"); Start data loading scala> df.write.format("carbondata").option("tableName","mycarbon2").save(); Check Logs leges:{}, groupPrivileges:null, rolePrivileges:null)) INFO 10-11 23:52:44,005 - Creating directory if it doesn't exist: hdfs://10.18.102.236:54310/opt/Carbon/Spark/spark/bin/null/bin/carbonshellstore/hivemetadata/mycarbon4 AUDIT 10-11 23:52:44,037 - [BLR107781][root][Thread-1]Table created with Database name [default] and Table name [mycarbon4] INFO 10-11 23:52:44,040 - Successfully able to get the table metadata file lock In the HDFS this Path is empty hdfs://10.18.102.236:54310/opt/Carbon/Spark/spark/bin/null/bin/carbonshellstore/hivemetadata/mycarbon4 Actual Store location is :- hdfs://10.18.102.236:54310/opt/Carbon/mystore Expect :- Empty folder should not be created. . It seems that it is created in SPARK_HOME/bin . SPARK_HOME is /opt/Carbon/Spark/spark/bin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: GC problem and performance refine problem
Hi Anning Luo, Can u please provide below details. 1.Create table ddl. 2.Number of node in you cluster setup. 3. Number of executors per node. 4. Query statistics. Please find my comments in bold. Problem: 1. GC problem. We suffer a 20%~30% GC time for some task in first stage after a lot of parameter refinement. We now use G1 GC in java8. GC time will double if use CMS. The main GC time is spent on young generation GC. Almost half memory of young generation will be copy to old generation. It seems lots of object has a long life than GC period and the space is not be reuse(as concurrent GC will release it later). When we use a large Eden(>=1G for example), once GC time will be seconds. If set Eden little(256M for example), once GC time will be hundreds milliseconds, but more frequency and total is still seconds. Is there any way to lessen the GC time? (We don’t consider the first query and second query in this case.) *How many node are present in your cluster setup?? If nodes are less please reduce the number of executors per node.* 2. Performance refine problem. Row number after being filtered is not uniform. Some node maybe heavy. It spend more time than other node. The time of one task is 4s ~ 16s. Is any method to refine it? 3. Too long time for first and second query. I know dictionary and some index need to be loaded for the first time. But after I trying use query below to preheat it, it still spend a lot of time. How could I preheat the query correctly? select Aarray, a, b, c… from Table1 where Aarray is not null and d = “sss” and e !=22 and f = 33 and g = 44 and h = 55 *Currently we are working on first time query improvement. For now you can run select count(*) or count(column), so all the blocks get loaded and then you can run the actual query.* 4. Any other suggestion to lessen the query time? Some suggestion: The log by class QueryStatisticsRecorder give me a good means to find the neck bottle, but not enough. There still some metric I think is very useful: 1. filter ratio. i.e.. not only result_size but also the origin size so we could know how many data is filtered. 2. IO time. The scan_blocks_time is not enough. If it is high, we know somethings wrong, but not know what cause that problem. The real IO time for data is not be provided. As there may be several file for one partition, know the program slow is caused by datanode or executor itself give us intuition to find the problem. 3. The TableBlockInfo for task. I log it by myself when debugging. It tell me how many blocklets is locality. The spark web monitor just give a locality level, but may be only one blocklet is locality. -Regards Kumar Vishal On Thu, Nov 10, 2016 at 8:55 PM, An Lanwrote: > Hi, > > We are using carbondata to build our table and running query in > CarbonContext. We have some performance problem during refining the system. > > *Background*: > > *cluster*: 100 executor,5 task/executor, 10G > memory/executor > > *data*: 60+GB(per one replica) as carbon data > format, 600+MB/file * 100 file, 300+columns, 300+million rows > > *sql example:* > > select A, > > sum(a), > > sum(b), > > sum(c), > > …( extra 100 aggregation like > sum(column)) > > from Table1 LATERAL VIEW > explode(split(Aarray, ‘*;*’)) ATable AS A > > where A is not null and d > “ab:c-10” > and d < “h:0f3s” and e!=10 and f=22 and g=33 and h=44 GROUP BY A > > *target query time*: <10s > > *current query time*: 15s ~ 25s > > *scene:* OLAP system. <100 queries every day. > Concurrency number is not high. Most time cpu is idle, so this service will > run with other program. The service will run for long time. We could not > occupy a very large memory for every executor. > > *refine*: I have build index and dictionary on > d, e, f, g, h and build dictionary on all other aggregation columns(i.e. a, > b, c, …100+ columns). And make sure there is one segment for total data. I > have open the speculation(quantile=0.5, interval=250, multiplier=1.2). > > Time is mainly spent on first stage before shuffling. As 95% data will be > filtered out, the shuffle process spend little time. In first stage, most > task complete in less than 10s. But there still be near 50 tasks longer > than 10s. Max task time in one query may be 12~16s. > > *Problem:* > > 1. GC problem. We suffer a 20%~30% GC time for some task in first > stage after a lot of parameter refinement. We now use G1 GC in java8. GC >
[GitHub] incubator-carbondata pull request #312: [CARBONDATA-404] Fixing dataframe sa...
GitHub user ravipesala opened a pull request: https://github.com/apache/incubator-carbondata/pull/312 [CARBONDATA-404] Fixing dataframe save when loading in cluster mode. Currently dataframe save writes temp csv in local folder so it fails in cluster mode. This PR changes the temp csv location to store path. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ravipesala/incubator-carbondata dataframe-csv-issue Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/312.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #312 commit ea94f9aeebe026c29c6c4f976ef268bb29517da7 Author: ravipesalaDate: 2016-11-10T15:03:32Z Fixing dataframe save when loading in cluster mode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (CARBONDATA-403) add example for data load without using kettle
Jacky Li created CARBONDATA-403: --- Summary: add example for data load without using kettle Key: CARBONDATA-403 URL: https://issues.apache.org/jira/browse/CARBONDATA-403 Project: CarbonData Issue Type: Improvement Reporter: Jacky Li add example for data load without using kettle -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-carbondata pull request #311: add example for data load without us...
GitHub user jackylk opened a pull request: https://github.com/apache/incubator-carbondata/pull/311 add example for data load without using kettle In this PR, example SQL and dataframe usage is added for loading data without kettle You can merge this pull request into a Git repository by running: $ git pull https://github.com/jackylk/incubator-carbondata no-kettle-example Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/311.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #311 commit 65d5dec6af97be79924a7c43a48c7a7baa540c7b Author: jackylkDate: 2016-11-10T15:20:52Z add no-kettle loading example --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (CARBONDATA-404) Data loading from DataFrame to carbon table is FAILED
Babulal created CARBONDATA-404: -- Summary: Data loading from DataFrame to carbon table is FAILED Key: CARBONDATA-404 URL: https://issues.apache.org/jira/browse/CARBONDATA-404 Project: CarbonData Issue Type: Bug Components: data-load Affects Versions: 0.1.0-incubating Reporter: Babulal Data loading FAILED when Loading data from DataFrame with tempCSV option =true(Default option ) in 3 Node cluster . Steps val customSchema = StructType(Array(StructField("imei", StringType, true), StructField("deviceInformationId", IntegerType, true),StructField("mac", StringType, true),StructField("productdate", TimestampType , true), StructField("updatetime", TimestampType, true),StructField("gamePointId", DoubleType, true),StructField("contractNumber", DoubleType, true) )); val df = cc.read.format("com.databricks.spark.csv").option("header", "false").schema(customSchema).load("/opt/data/xyz/100_default_date_11_header.csv"); Start data loading scala> df.write.format("carbondata").option("tableName","mycarbon2").save(); INFO 10-11 23:24:35,970 - main Query [ CREATE TABLE IF NOT EXISTS DEFAULT.MYCARBON2 (IMEI STRING, DEVICEINFORMATIONID INT, MAC STRING, PRODUCTDATE TIMESTAMP, UPDATETIME TIMESTAMP, GAMEPOINTID DOUBLE, CONTRACTNUMBER DOUBLE) STORED BY 'ORG.APACHE.CARBONDATA.FORMAT' ] INFO 10-11 23:24:35,977 - Parsing command: CREATE TABLE IF NOT EXISTS default.mycarbon2 (imei STRING, deviceInformationId INT, mac STRING, productdate TIMESTAMP, updatetime TIMESTAMP, gamePointId DOUBLE, contractNumber DOUBLE) STORED BY 'org.apache.carbondata.format' INFO 10-11 23:24:35,978 - Parse Completed INFO 10-11 23:24:36,227 - main Query [ LOAD DATA INPATH './TEMPCSV' INTO TABLE DEFAULT.MYCARBON2 OPTIONS ('FILEHEADER' = 'IMEI,DEVICEINFORMATIONID,MAC,PRODUCTDATE,UPDATETIME,GAMEPOINTID,CONTRACTNUMBER') ] INFO 10-11 23:24:36,233 - Successfully able to get the table metadata file lock AUDIT 10-11 23:24:36,234 - [BLR107781][root][Thread-1]Dataload failed for default.mycarbon2. The input file does not exist: ./tempCSV INFO 10-11 23:24:36,234 - main Successfully deleted the lock file /tmp/default/mycarbon2/meta.lock INFO 10-11 23:24:36,234 - Table MetaData Unlocked Successfully after data load org.apache.carbondata.processing.etl.DataLoadingException: The input file does not exist: ./tempCSV at org.apache.spark.util.FileUtils$$anonfun$getPaths$1.apply$mcVI$sp(FileUtils.scala:66) CSV DATA 1AA1,1,Mikaa1,2015-01-01 11:00:00,2015-01-01 13:00:00,198,260 1AA2,3,Mikaa2,2015-01-02 12:00:00,2015-01-01 14:00:00,278,230 1AA3,1,Mikaa1,2015-01-03 13:00:00,2015-01-01 15:00:00,2556,1 1AA4,10,Mikaa2,2015-01-04 14:00:00,2015-01-01 16:00:00,640,254 1AA5,10,Mikaa,2015-01-05 15:00:00,2015-01-01 17:00:00,980,256 1AA6,10,Mikaa,2015-01-06 16:00:00,2015-01-01 18:00:00,1,2378 1AA7,10,Mikaa,2015-01-07 17:00:00,2015-01-01 19:00:00,96,234 1AA8,9,max,2015-01-08 18:00:00,2015-01-01 20:00:00,89,236 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
GC problem and performance refine problem
Hi, We are using carbondata to build our table and running query in CarbonContext. We have some performance problem during refining the system. *Background*: *cluster*: 100 executor,5 task/executor, 10G memory/executor *data*: 60+GB(per one replica) as carbon data format, 600+MB/file * 100 file, 300+columns, 300+million rows *sql example:* select A, sum(a), sum(b), sum(c), …( extra 100 aggregation like sum(column)) from Table1 LATERAL VIEW explode(split(Aarray, ‘*;*’)) ATable AS A where A is not null and d > “ab:c-10” and d < “h:0f3s” and e!=10 and f=22 and g=33 and h=44 GROUP BY A *target query time*: <10s *current query time*: 15s ~ 25s *scene:* OLAP system. <100 queries every day. Concurrency number is not high. Most time cpu is idle, so this service will run with other program. The service will run for long time. We could not occupy a very large memory for every executor. *refine*: I have build index and dictionary on d, e, f, g, h and build dictionary on all other aggregation columns(i.e. a, b, c, …100+ columns). And make sure there is one segment for total data. I have open the speculation(quantile=0.5, interval=250, multiplier=1.2). Time is mainly spent on first stage before shuffling. As 95% data will be filtered out, the shuffle process spend little time. In first stage, most task complete in less than 10s. But there still be near 50 tasks longer than 10s. Max task time in one query may be 12~16s. *Problem:* 1. GC problem. We suffer a 20%~30% GC time for some task in first stage after a lot of parameter refinement. We now use G1 GC in java8. GC time will double if use CMS. The main GC time is spent on young generation GC. Almost half memory of young generation will be copy to old generation. It seems lots of object has a long life than GC period and the space is not be reuse(as concurrent GC will release it later). When we use a large Eden(>=1G for example), once GC time will be seconds. If set Eden little(256M for example), once GC time will be hundreds milliseconds, but more frequency and total is still seconds. Is there any way to lessen the GC time? (We don’t consider the first query and second query in this case.) 2. Performance refine problem. Row number after being filtered is not uniform. Some node maybe heavy. It spend more time than other node. The time of one task is 4s ~ 16s. Is any method to refine it? 3. Too long time for first and second query. I know dictionary and some index need to be loaded for the first time. But after I trying use query below to preheat it, it still spend a lot of time. How could I preheat the query correctly? select Aarray, a, b, c… from Table1 where Aarray is not null and d = “sss” and e !=22 and f = 33 and g = 44 and h = 55 4. Any other suggestion to lessen the query time? Some suggestion: The log by class QueryStatisticsRecorder give me a good means to find the neck bottle, but not enough. There still some metric I think is very useful: 1. filter ratio. i.e.. not only result_size but also the origin size so we could know how many data is filtered. 2. IO time. The scan_blocks_time is not enough. If it is high, we know somethings wrong, but not know what cause that problem. The real IO time for data is not be provided. As there may be several file for one partition, know the program slow is caused by datanode or executor itself give us intuition to find the problem. 3. The TableBlockInfo for task. I log it by myself when debugging. It tell me how many blocklets is locality. The spark web monitor just give a locality level, but may be only one blocklet is locality. - Anning Luo *HULU* Email: anning@hulu.com lanan...@gmail.com
join mail list
As above
join mail list
[GitHub] incubator-carbondata pull request #263: [CARBONDATA-2] Data load integration...
Github user asfgit closed the pull request at: https://github.com/apache/incubator-carbondata/pull/263 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #290: [CARBONDATA-371] Write unit test for...
Github user harmeetsingh0013 closed the pull request at: https://github.com/apache/incubator-carbondata/pull/290 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [VOTE] Apache CarbonData 0.2.0-incubating release
+1 -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/VOTE-Apache-CarbonData-0-2-0-incubating-release-tp2823p2836.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
[GitHub] incubator-carbondata pull request #310: [CARBONDATA-401][WIP] One Pass Load
GitHub user lion-x opened a pull request: https://github.com/apache/incubator-carbondata/pull/310 [CARBONDATA-401][WIP] One Pass Load # Why raise this PR? # How to do? - [ ] Trans option useOnePass in Load Statement into CarbonCSVBasedSeqGenStep.java - [ ] - [ ] - [ ] You can merge this pull request into a Git repository by running: $ git pull https://github.com/lion-x/incubator-carbondata onePass Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/310.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #310 commit 611b0135425691eeb8fbc19469485834d23b2008 Author: lion-xDate: 2016-11-10T09:08:42Z transonepass --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---