[GitHub] incubator-carbondata issue #626: [WIP]Fixed loading issues in TPC-DS data fo...
Github user CarbonDataQA commented on the issue: https://github.com/apache/incubator-carbondata/pull/626 Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1019/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Resolved] (CARBONDATA-691) After Compaction records count are mismatched.
[ https://issues.apache.org/jira/browse/CARBONDATA-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala resolved CARBONDATA-691. Resolution: Fixed Fix Version/s: 1.0.1-incubating > After Compaction records count are mismatched. > -- > > Key: CARBONDATA-691 > URL: https://issues.apache.org/jira/browse/CARBONDATA-691 > Project: CarbonData > Issue Type: Bug > Components: data-load, data-query, docs >Affects Versions: 1.0.0-incubating >Reporter: Babulal >Assignee: sounak chakraborty > Fix For: 1.0.1-incubating > > Attachments: createLoadcmd.txt, driverlog.txt > > Time Spent: 3.5h > Remaining Estimate: 0h > > Spark version - Spark 1.6.2 and spark2.1 > After Compaction data showing is wrong. > create table and load 4 times s( compaction threshold is 4,3) > Load 4 times same data .each load 105 records as attached in file . > --+--+ > | SegmentSequenceId | Status | Load Start Time | Load End > Time | > +++--+--+--+ > | 3 | Compacted | 2017-02-01 14:07:51.922 | 2017-02-01 > 14:07:52.591 | > | 2 | Compacted | 2017-02-01 14:07:33.481 | 2017-02-01 > 14:07:34.443 | > | 1 | Compacted | 2017-02-01 14:07:23.495 | 2017-02-01 > 14:07:24.167 | > | 0.1| Success| 2017-02-01 14:07:52.815 | 2017-02-01 > 14:07:57.201 | > | 0 | Compacted | 2017-02-01 14:07:07.541 | 2017-02-01 > 14:07:11.983 | > +++--+--+--+ > 5 rows selected (0.021 seconds) > 0: jdbc:hive2://8.99.61.4:23040> select count(*) from > Comp_VMALL_DICTIONARY_INCLUDE_7; > +---+--+ > | count(1) | > +---+--+ > | 1680 | > +---+--+ > 1 row selected (4.468 seconds) > 0: jdbc:hive2://8.99.61.4:23040> select count(imei) from > Comp_VMALL_DICTIONARY_INCLUDE_7; > +--+--+ > | count(imei) | > +--+--+ > | 1680 | > +--+--+ > Expected :- total records should be 420 . -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] incubator-carbondata pull request #604: [CARBONDATA-691] After Compaction re...
Github user asfgit closed the pull request at: https://github.com/apache/incubator-carbondata/pull/604 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata issue #604: [CARBONDATA-691] After Compaction records c...
Github user ravipesala commented on the issue: https://github.com/apache/incubator-carbondata/pull/604 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #626: [WIP]Fixed loading issues in TPC-DS ...
GitHub user ravipesala opened a pull request: https://github.com/apache/incubator-carbondata/pull/626 [WIP]Fixed loading issues in TPC-DS data for V3 format You can merge this pull request into a Git repository by running: $ git pull https://github.com/ravipesala/incubator-carbondata dictionary-server-issue Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/626.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #626 commit 187d5e8f61f9a401e6e21a64db8cc68326c50287 Author: ravipesalaDate: 2017-03-07T06:37:48Z Fixed loading issues in TPC-DS data --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Assigned] (CARBONDATA-750) Improve exception information description while user input wrong creation table script
[ https://issues.apache.org/jira/browse/CARBONDATA-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anubhav tarar reassigned CARBONDATA-750: Assignee: anubhav tarar > Improve exception information description while user input wrong creation > table script > -- > > Key: CARBONDATA-750 > URL: https://issues.apache.org/jira/browse/CARBONDATA-750 > Project: CarbonData > Issue Type: Improvement > Components: sql >Reporter: Liang Chen >Assignee: anubhav tarar >Priority: Minor > > 1. Use wrong creation table script: > scala> carbon.sql("CREATE TABLE carbontable1 (id,int,age string,year,int) > STORED BY 'carbondata'") > java.lang.RuntimeException: [1.1] failure: identifier matching regex > (?i)ALTER expected > CREATE TABLE carbontable1 (id,int,age string,year,int) STORED BY 'carbondata' > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parser.CarbonSpark2SqlParser.parse(CarbonSpark2SqlParser.scala:45) > at > org.apache.spark.sql.parser.CarbonSparkSqlParser.parsePlan(CarbonSparkSqlParser.scala:51) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > 2.Need improve the exception information description, like : unexpected "," > found > CREATE TABLE carbontable1 (id,int,age string,year,int) STORED BY > ^ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] incubator-carbondata issue #618: [CARBONDATA-734] Support the syntax of 'STO...
Github user CarbonDataQA commented on the issue: https://github.com/apache/incubator-carbondata/pull/618 Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1018/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata issue #618: [CARBONDATA-734] Support the syntax of 'STO...
Github user watermen commented on the issue: https://github.com/apache/incubator-carbondata/pull/618 @ravipesala Please review the testcase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #625: [CARBONDATA-743] Remove redundant Ca...
Github user lionelcao commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/625#discussion_r104575573 --- Diff: integration/spark2/src/main/scala/org/apache/carbondata/spark/CarbonFilters.scala --- @@ -1,397 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.carbondata.spark - -import scala.collection.mutable.ArrayBuffer - -import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.optimizer.AttributeReferenceWrapper -import org.apache.spark.sql.sources -import org.apache.spark.sql.types.StructType - -import org.apache.carbondata.core.metadata.datatype.DataType -import org.apache.carbondata.core.metadata.schema.table.CarbonTable -import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn -import org.apache.carbondata.core.scan.expression.{ColumnExpression => CarbonColumnExpression, Expression => CarbonExpression, LiteralExpression => CarbonLiteralExpression} -import org.apache.carbondata.core.scan.expression.conditional._ -import org.apache.carbondata.core.scan.expression.logical.{AndExpression, FalseExpression, OrExpression} -import org.apache.carbondata.spark.util.CarbonScalaUtil - -/** - * All filter conversions are done here. - */ -object CarbonFilters { - - /** - * Converts data sources filters to carbon filter predicates. - */ - def createCarbonFilter(schema: StructType, - predicate: sources.Filter): Option[CarbonExpression] = { -val dataTypeOf = schema.map(f => f.name -> f.dataType).toMap - -def createFilter(predicate: sources.Filter): Option[CarbonExpression] = { - predicate match { - -case sources.EqualTo(name, value) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.Not(sources.EqualTo(name, value)) => - Some(new NotEqualsExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.EqualNullSafe(name, value) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.Not(sources.EqualNullSafe(name, value)) => - Some(new NotEqualsExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.GreaterThan(name, value) => - Some(new GreaterThanExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.LessThan(name, value) => - Some(new LessThanExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.GreaterThanOrEqual(name, value) => - Some(new GreaterThanEqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.LessThanOrEqual(name, value) => - Some(new LessThanEqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.In(name, values) => - Some(new InExpression(getCarbonExpression(name), -new ListExpression( - convertToJavaList(values.map(f => getCarbonLiteralExpression(name, f)).toList -case sources.Not(sources.In(name, values)) => - Some(new NotInExpression(getCarbonExpression(name), -new ListExpression( - convertToJavaList(values.map(f => getCarbonLiteralExpression(name, f)).toList - -case sources.IsNull(name) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, null), true)) -case sources.IsNotNull(name) => - Some(new NotEqualsExpression(getCarbonExpression(name), -
[GitHub] incubator-carbondata pull request #625: [CARBONDATA-743] Remove redundant Ca...
Github user lionelcao commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/625#discussion_r104575582 --- Diff: integration/spark2/src/main/scala/org/apache/carbondata/spark/CarbonFilters.scala --- @@ -1,397 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.carbondata.spark - -import scala.collection.mutable.ArrayBuffer - -import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.optimizer.AttributeReferenceWrapper -import org.apache.spark.sql.sources -import org.apache.spark.sql.types.StructType - -import org.apache.carbondata.core.metadata.datatype.DataType -import org.apache.carbondata.core.metadata.schema.table.CarbonTable -import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn -import org.apache.carbondata.core.scan.expression.{ColumnExpression => CarbonColumnExpression, Expression => CarbonExpression, LiteralExpression => CarbonLiteralExpression} -import org.apache.carbondata.core.scan.expression.conditional._ -import org.apache.carbondata.core.scan.expression.logical.{AndExpression, FalseExpression, OrExpression} -import org.apache.carbondata.spark.util.CarbonScalaUtil - -/** - * All filter conversions are done here. - */ -object CarbonFilters { - - /** - * Converts data sources filters to carbon filter predicates. - */ - def createCarbonFilter(schema: StructType, - predicate: sources.Filter): Option[CarbonExpression] = { -val dataTypeOf = schema.map(f => f.name -> f.dataType).toMap - -def createFilter(predicate: sources.Filter): Option[CarbonExpression] = { - predicate match { - -case sources.EqualTo(name, value) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.Not(sources.EqualTo(name, value)) => - Some(new NotEqualsExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.EqualNullSafe(name, value) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.Not(sources.EqualNullSafe(name, value)) => - Some(new NotEqualsExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.GreaterThan(name, value) => - Some(new GreaterThanExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.LessThan(name, value) => - Some(new LessThanExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.GreaterThanOrEqual(name, value) => - Some(new GreaterThanEqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.LessThanOrEqual(name, value) => - Some(new LessThanEqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.In(name, values) => - Some(new InExpression(getCarbonExpression(name), -new ListExpression( - convertToJavaList(values.map(f => getCarbonLiteralExpression(name, f)).toList -case sources.Not(sources.In(name, values)) => - Some(new NotInExpression(getCarbonExpression(name), -new ListExpression( - convertToJavaList(values.map(f => getCarbonLiteralExpression(name, f)).toList - -case sources.IsNull(name) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, null), true)) -case sources.IsNotNull(name) => - Some(new NotEqualsExpression(getCarbonExpression(name), -
[GitHub] incubator-carbondata pull request #625: [CARBONDATA-743] Remove redundant Ca...
Github user lionelcao commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/625#discussion_r104578046 --- Diff: integration/spark2/src/main/scala/org/apache/carbondata/spark/CarbonFilters.scala --- @@ -1,397 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.carbondata.spark - -import scala.collection.mutable.ArrayBuffer - -import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.optimizer.AttributeReferenceWrapper -import org.apache.spark.sql.sources -import org.apache.spark.sql.types.StructType - -import org.apache.carbondata.core.metadata.datatype.DataType -import org.apache.carbondata.core.metadata.schema.table.CarbonTable -import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn -import org.apache.carbondata.core.scan.expression.{ColumnExpression => CarbonColumnExpression, Expression => CarbonExpression, LiteralExpression => CarbonLiteralExpression} -import org.apache.carbondata.core.scan.expression.conditional._ -import org.apache.carbondata.core.scan.expression.logical.{AndExpression, FalseExpression, OrExpression} -import org.apache.carbondata.spark.util.CarbonScalaUtil - -/** - * All filter conversions are done here. - */ -object CarbonFilters { - - /** - * Converts data sources filters to carbon filter predicates. - */ - def createCarbonFilter(schema: StructType, - predicate: sources.Filter): Option[CarbonExpression] = { -val dataTypeOf = schema.map(f => f.name -> f.dataType).toMap - -def createFilter(predicate: sources.Filter): Option[CarbonExpression] = { - predicate match { - -case sources.EqualTo(name, value) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.Not(sources.EqualTo(name, value)) => - Some(new NotEqualsExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.EqualNullSafe(name, value) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.Not(sources.EqualNullSafe(name, value)) => - Some(new NotEqualsExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.GreaterThan(name, value) => - Some(new GreaterThanExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.LessThan(name, value) => - Some(new LessThanExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.GreaterThanOrEqual(name, value) => - Some(new GreaterThanEqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.LessThanOrEqual(name, value) => - Some(new LessThanEqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.In(name, values) => - Some(new InExpression(getCarbonExpression(name), -new ListExpression( - convertToJavaList(values.map(f => getCarbonLiteralExpression(name, f)).toList -case sources.Not(sources.In(name, values)) => - Some(new NotInExpression(getCarbonExpression(name), -new ListExpression( - convertToJavaList(values.map(f => getCarbonLiteralExpression(name, f)).toList - -case sources.IsNull(name) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, null), true)) -case sources.IsNotNull(name) => - Some(new NotEqualsExpression(getCarbonExpression(name), -
[GitHub] incubator-carbondata pull request #614: [CARBONDATA-714]Documented how to ha...
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/614#discussion_r104550978 --- Diff: docs/faq.md --- @@ -18,30 +18,57 @@ --> # FAQs -* **Auto Compaction not Working** -The Property carbon.enable.auto.load.merge in carbon.properties need to be set to true. +* [What are Bad Records?](#what-are-bad-records) +* [Where are Bad Records Stored in CarbonData?](#where-are-bad-records-stored-in-carbondata) +* [How to handle Bad Records?](#how-to-handle-bad-records) +* [How to resolve store location canât be found?](#how-to-resolve-store-location-can-not-be-found) +* [What is Carbon Lock Type?](#what-is-carbon-lock-type) +* [How to resolve Abstract Method Error?](#how-to-resolve-abstract-method-error) -* **Getting Abstract method error** +## What are Bad Records? +Records that fail to get loaded into the CarbonData due to data type incompatibility or are empty or have incompatible format are classified as Bad Records. -You need to specify the spark version while using Maven to build project. +## Where are Bad Records Stored in CarbonData? +The bad records are stored at the location set in carbon.badRecords.location in carbon.properties file. +By default **carbon.badRecords.location** specifies the following location ``/opt/Carbon/Spark/badrecords``. -* **Getting NotImplementedException for subquery using IN and EXISTS** +## How to handle Bad Records? +While loading data we can specify the approach to handle Bad Records. In order to analyse the cause of the Bad Records the parameter ``BAD_RECORDS_LOGGER_ENABLE`` must be set to value ``TRUE``. There are three approaches to handle Bad Records which can be specified by the parameter ``BAD_RECORDS_ACTION``. -Subquery with in and exists not supported in CarbonData. - -* **Getting Exceptions on creating a view** - -View not supported in CarbonData. - -* **How to verify if ColumnGroups have been created as desired.** +- To pad the incorrect values of the csv rows with NULL value and load the data in CarbonData, set the following in the query : +``` +'BAD_RECORDS_ACTION'='FORCE' +``` --- End diff -- Please add "How to ignore the bad records" ? Please find the detail discussion at here : http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/data-lost-when-loading-data-from-csv-file-to-carbon-table-td7554.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #614: [CARBONDATA-714]Documented how to ha...
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/614#discussion_r104550787 --- Diff: docs/faq.md --- @@ -18,30 +18,57 @@ --> # FAQs -* **Auto Compaction not Working** -The Property carbon.enable.auto.load.merge in carbon.properties need to be set to true. +* [What are Bad Records?](#what-are-bad-records) +* [Where are Bad Records Stored in CarbonData?](#where-are-bad-records-stored-in-carbondata) +* [How to handle Bad Records?](#how-to-handle-bad-records) +* [How to resolve store location canât be found?](#how-to-resolve-store-location-can-not-be-found) +* [What is Carbon Lock Type?](#what-is-carbon-lock-type) +* [How to resolve Abstract Method Error?](#how-to-resolve-abstract-method-error) -* **Getting Abstract method error** +## What are Bad Records? +Records that fail to get loaded into the CarbonData due to data type incompatibility or are empty or have incompatible format are classified as Bad Records. -You need to specify the spark version while using Maven to build project. +## Where are Bad Records Stored in CarbonData? +The bad records are stored at the location set in carbon.badRecords.location in carbon.properties file. +By default **carbon.badRecords.location** specifies the following location ``/opt/Carbon/Spark/badrecords``. -* **Getting NotImplementedException for subquery using IN and EXISTS** +## How to handle Bad Records? --- End diff -- This is "how to enable bad record logging". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/624#discussion_r104566012 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.examples + +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} + +import org.apache.carbondata.core.util.CarbonProperties + +// scalastyle:off println +object CompareTest { + + val parquetTableName = "comparetest_parquet" + val carbonTableName = "comparetest_carbon" + + private def generateDataFrame(spark: SparkSession): DataFrame = { +import spark.implicits._ +spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4) +.map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x +* 5)) +.toDF("id", "country", "city", "c4", "c5", "c6", "c7", "c8") --- End diff -- ok, I found decimal is not supported for dataframe.write, I will raise a JIRA --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/624#discussion_r104565728 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.examples + +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} + +import org.apache.carbondata.core.util.CarbonProperties + +// scalastyle:off println +object CompareTest { + + val parquetTableName = "comparetest_parquet" + val carbonTableName = "comparetest_carbon" + + private def generateDataFrame(spark: SparkSession): DataFrame = { +import spark.implicits._ +spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4) +.map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/624#discussion_r104564975 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.examples + +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} + +import org.apache.carbondata.core.util.CarbonProperties + +// scalastyle:off println +object CompareTest { + + val parquetTableName = "comparetest_parquet" + val carbonTableName = "comparetest_carbon" + + private def generateDataFrame(spark: SparkSession): DataFrame = { +import spark.implicits._ +spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4) +.map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x +* 5)) +.toDF("id", "country", "city", "c4", "c5", "c6", "c7", "c8") + } + + private def loadParquetTable(spark: SparkSession, input: DataFrame): Long = timeit { +input.write.mode(SaveMode.Overwrite).parquet(parquetTableName) + } + + private def loadCarbonTable(spark: SparkSession, input: DataFrame): Long = { +spark.sql(s"drop table if exists $carbonTableName") +timeit { + input.write + .format("carbondata") + .option("tableName", carbonTableName) + .option("tempCSV", "false") + .option("single_pass", "true") + .option("dictionary_exclude", "id") // id is high cardinality column + .mode(SaveMode.Overwrite) + .save() +} + } + + private def prepareTable(spark: SparkSession): Unit = { +val df = generateDataFrame(spark).cache() +println(s"loading dataframe into table, schema: ${df.schema}") +val loadParquetTime = loadParquetTable(spark, df) +val loadCarbonTime = loadCarbonTable(spark, df) +println(s"load completed, time: $loadParquetTime, $loadCarbonTime") + spark.read.parquet(parquetTableName).registerTempTable(parquetTableName) + } + + private def runQuery(spark: SparkSession): Unit = { +val test = Array( + "select count(*) from $table", + "select sum(c4) from $table", + "select sum(c4), sum(c5) from $table", + "select sum(c4), sum(c5), sum(c6) from $table", + "select sum(c4), sum(c5), sum(c6), sum(c7) from $table", + "select sum(c4), sum(c5), sum(c6), sum(c7), avg(c8) from $table", + "select * from $table where id = 'i999' ", + "select * from $table where country = 'p9' ", + "select * from $table where city = 'j99' ", + "select * from $table where c4 < 1000 " --- End diff -- added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/624#discussion_r104564727 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.examples + +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} +import org.apache.spark.sql.types._ + +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.carbondata.core.util.CarbonProperties + +/** + * A query test case + * @param sqlText SQL statement + * @param queryType type of query: scan, filter, aggregate, topN + * @param desc description of the goal of this test case + */ +case class Query(sqlText: String, queryType: String, desc: String) + +// scalastyle:off println +object CompareTest { + + def parquetTableName: String = "comparetest_parquet" + def carbonTableName(version: String): String = s"comparetest_carbonV$version" + + // Table schema: + // +-+---+-+-++ + // | Column name | Data type | Cardinality | Column type | Dictionary | + // +-+---+-+-++ + // | id | string| 10,000,000 | dimension | no | + // +-+---+-+-++ + // | country | string| 1103| dimension | yes| + // +-+---+-+-++ + // | city| string| 13 | dimension | yes| + // +-+---+-+-++ + // | c4 | short | NA | measure | no | + // +-+---+-+-++ + // | c5 | int | NA | measure | no | + // +-+---+-+-++ + // | c6 | big int | NA | measure | no | + // +-+---+-+-++ + // | c7 | double| NA | measure | no | + // +-+---+-+-++ + // | c8 | double| NA | measure | no | + // +-+---+-+-++ + private def generateDataFrame(spark: SparkSession): DataFrame = { +val rdd = spark.sparkContext +.parallelize(1 to 10 * 1000 * 1000, 4) +.map { x => + (x.toString, "p" + x % 1103, "j" + x % 13, (x % 31).toShort, x, x.toLong * 1000, + x.toDouble / 13, x.toDouble / 71 ) +}.map { x => + Row(x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8) +} +val schema = StructType( + Seq( +StructField("id", StringType, nullable = false), +StructField("country", StringType, nullable = false), +StructField("city", StringType, nullable = false), +StructField("c4", ShortType, nullable = true), +StructField("c5", IntegerType, nullable = true), +StructField("c6", LongType, nullable = true), +StructField("c7", DoubleType, nullable = true), +StructField("c8", DoubleType, nullable = true) + ) +) +spark.createDataFrame(rdd, schema) + } + + // performance test queries + val queries: Array[Query] = Array( +Query( + "select count(*) from $table", + "warm up", + "warm up query" +), +// === +// == FULL SCAN == +//
[jira] [Created] (CARBONDATA-750) Improve exception information description while user input wrong creation table script
Liang Chen created CARBONDATA-750: - Summary: Improve exception information description while user input wrong creation table script Key: CARBONDATA-750 URL: https://issues.apache.org/jira/browse/CARBONDATA-750 Project: CarbonData Issue Type: Improvement Components: sql Reporter: Liang Chen Priority: Minor 1. Use wrong creation table script: scala> carbon.sql("CREATE TABLE carbontable1 (id,int,age string,year,int) STORED BY 'carbondata'") java.lang.RuntimeException: [1.1] failure: identifier matching regex (?i)ALTER expected CREATE TABLE carbontable1 (id,int,age string,year,int) STORED BY 'carbondata' ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parser.CarbonSpark2SqlParser.parse(CarbonSpark2SqlParser.scala:45) at org.apache.spark.sql.parser.CarbonSparkSqlParser.parsePlan(CarbonSparkSqlParser.scala:51) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) 2.Need improve the exception information description, like : unexpected "," found CREATE TABLE carbontable1 (id,int,age string,year,int) STORED BY ^ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (CARBONDATA-749) Unexpected error log message while dropping carbon table
Liang Chen created CARBONDATA-749: - Summary: Unexpected error log message while dropping carbon table Key: CARBONDATA-749 URL: https://issues.apache.org/jira/browse/CARBONDATA-749 Project: CarbonData Issue Type: Bug Components: sql Affects Versions: 1.0.0-incubating Reporter: Liang Chen Priority: Minor 1.Create a table with the below script: carbon.sql("CREATE TABLE carbontable1 (id int, age string, year string) STORED BY 'carbondata'") 2.Drop table "carbontable1" with the below script: carbon.sql("drop table carbontable1") Unexpected error log message as below: AUDIT 07-03 07:50:11,944 - [AppledeMacBook-Pro.local][apple][Thread-1]Deleting table [carbontable1] under database [default] AUDIT 07-03 07:50:12,086 - [AppledeMacBook-Pro.local][apple][Thread-1]Creating Table with Database name [default] and Table name [carbontable1] AUDIT 07-03 07:50:12,095 - [AppledeMacBook-Pro.local][apple][Thread-1]Table creation with Database name [default] and Table name [carbontable1] failed. Table [carbontable1] already exists under database [default] WARN 07-03 07:50:12,095 - org.spark_project.guava.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: Table [carbontable1] already exists under database [default] org.spark_project.guava.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: Table [carbontable1] already exists under database [default] at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2263) at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4880) at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:110) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69) at org.apache.spark.sql.SparkSession.table(SparkSession.scala:578) at org.apache.spark.sql.SparkSession.table(SparkSession.scala:574) at org.apache.spark.sql.execution.command.DropTableCommand.run(ddl.scala:203) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87) at org.apache.spark.sql.Dataset.(Dataset.scala:185) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) at org.apache.spark.sql.hive.CarbonHiveMetadataUtil$.invalidateAndDropTable(CarbonHiveMetadataUtil.scala:44) at org.apache.spark.sql.hive.CarbonMetastore.dropTable(CarbonMetastore.scala:435) at org.apache.spark.sql.execution.command.CarbonDropTableCommand.run(carbonTableSchema.scala:665) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at
[GitHub] incubator-carbondata pull request #614: [CARBONDATA-714]Documented how to ha...
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/614#discussion_r104548399 --- Diff: docs/faq.md --- @@ -18,30 +18,57 @@ --> # FAQs -* **Auto Compaction not Working** -The Property carbon.enable.auto.load.merge in carbon.properties need to be set to true. +* [What are Bad Records?](#what-are-bad-records) +* [Where are Bad Records Stored in CarbonData?](#where-are-bad-records-stored-in-carbondata) +* [How to handle Bad Records?](#how-to-handle-bad-records) +* [How to resolve store location canât be found?](#how-to-resolve-store-location-can-not-be-found) +* [What is Carbon Lock Type?](#what-is-carbon-lock-type) +* [How to resolve Abstract Method Error?](#how-to-resolve-abstract-method-error) -* **Getting Abstract method error** +## What are Bad Records? +Records that fail to get loaded into the CarbonData due to data type incompatibility or are empty or have incompatible format are classified as Bad Records. -You need to specify the spark version while using Maven to build project. +## Where are Bad Records Stored in CarbonData? +The bad records are stored at the location set in carbon.badRecords.location in carbon.properties file. +By default **carbon.badRecords.location** specifies the following location ``/opt/Carbon/Spark/badrecords``. -* **Getting NotImplementedException for subquery using IN and EXISTS** +## How to handle Bad Records? +While loading data we can specify the approach to handle Bad Records. In order to analyse the cause of the Bad Records the parameter ``BAD_RECORDS_LOGGER_ENABLE`` must be set to value ``TRUE``. There are three approaches to handle Bad Records which can be specified by the parameter ``BAD_RECORDS_ACTION``. -Subquery with in and exists not supported in CarbonData. - -* **Getting Exceptions on creating a view** - -View not supported in CarbonData. - -* **How to verify if ColumnGroups have been created as desired.** +- To pad the incorrect values of the csv rows with NULL value and load the data in CarbonData, set the following in the query : +``` +'BAD_RECORDS_ACTION'='FORCE' +``` + +- To write the Bad Records without padding incorrect values with NULL in the raw csv (set in the parameter **carbon.badRecords.location**), set the following in the query : +``` +'BAD_RECORDS_ACTION'='REDIRECT' +``` + +- To ignore the Bad Records from getting stored in the raw csv, we need to set the following in the query : +``` +'BAD_RECORDS_ACTION'='INDIRECT' +``` + +## How to resolve store location can not be found? --- End diff -- Seems the title should be : How to specify storelocation while creating carbonsession --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #614: [CARBONDATA-714]Documented how to ha...
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/614#discussion_r104547671 --- Diff: docs/faq.md --- @@ -18,30 +18,57 @@ --> # FAQs -* **Auto Compaction not Working** -The Property carbon.enable.auto.load.merge in carbon.properties need to be set to true. +* [What are Bad Records?](#what-are-bad-records) +* [Where are Bad Records Stored in CarbonData?](#where-are-bad-records-stored-in-carbondata) +* [How to handle Bad Records?](#how-to-handle-bad-records) +* [How to resolve store location canât be found?](#how-to-resolve-store-location-can-not-be-found) +* [What is Carbon Lock Type?](#what-is-carbon-lock-type) +* [How to resolve Abstract Method Error?](#how-to-resolve-abstract-method-error) -* **Getting Abstract method error** +## What are Bad Records? +Records that fail to get loaded into the CarbonData due to data type incompatibility or are empty or have incompatible format are classified as Bad Records. -You need to specify the spark version while using Maven to build project. +## Where are Bad Records Stored in CarbonData? +The bad records are stored at the location set in carbon.badRecords.location in carbon.properties file. +By default **carbon.badRecords.location** specifies the following location ``/opt/Carbon/Spark/badrecords``. -* **Getting NotImplementedException for subquery using IN and EXISTS** +## How to handle Bad Records? +While loading data we can specify the approach to handle Bad Records. In order to analyse the cause of the Bad Records the parameter ``BAD_RECORDS_LOGGER_ENABLE`` must be set to value ``TRUE``. There are three approaches to handle Bad Records which can be specified by the parameter ``BAD_RECORDS_ACTION``. -Subquery with in and exists not supported in CarbonData. - -* **Getting Exceptions on creating a view** - -View not supported in CarbonData. - -* **How to verify if ColumnGroups have been created as desired.** +- To pad the incorrect values of the csv rows with NULL value and load the data in CarbonData, set the following in the query : +``` +'BAD_RECORDS_ACTION'='FORCE' +``` + +- To write the Bad Records without padding incorrect values with NULL in the raw csv (set in the parameter **carbon.badRecords.location**), set the following in the query : +``` +'BAD_RECORDS_ACTION'='REDIRECT' +``` + +- To ignore the Bad Records from getting stored in the raw csv, we need to set the following in the query : +``` +'BAD_RECORDS_ACTION'='INDIRECT' +``` + +## How to resolve store location can not be found? +The store location specified while creating carbon session is used by the CarbonData to store the meta data like the schema, dictionary files, dictionary meta data and sort indexes. + +Try creating ``carbonsession`` with ``storepath`` specified in the following manner : +``` +val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession() +``` +Example: +``` +val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://localhost:9000/carbon/store ") +``` + +## What is Carbon Lock Type? --- End diff -- For users, which scenario need to set this parameter for lock? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #622: [CARBONDATA-744] The property "spark...
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/622#discussion_r104545841 --- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonScanRDD.scala --- @@ -116,8 +116,8 @@ class CarbonScanRDD( i += 1 result.add(partition) } - } else if (sparkContext.getConf.contains("spark.carbon.custom.distribution") && - sparkContext.getConf.getBoolean("spark.carbon.custom.distribution", false)) { + } else if (java.lang.Boolean + .parseBoolean(CarbonProperties.getInstance().getProperty("carbon.custom.distribution"))) { --- End diff -- The PR's title mentions that the property is "spark.carbon.custom.distribution", but here you change the property name to "carbon.custom.distribution", why ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata issue #624: [CARBONDATA-747][WIP] Add simple performanc...
Github user CarbonDataQA commented on the issue: https://github.com/apache/incubator-carbondata/pull/624 Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1017/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata issue #624: [CARBONDATA-747][WIP] Add simple performanc...
Github user CarbonDataQA commented on the issue: https://github.com/apache/incubator-carbondata/pull/624 Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1016/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #625: [CARBONDATA-743] Remove redundant Ca...
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/625#discussion_r104424923 --- Diff: integration/spark2/src/main/scala/org/apache/carbondata/spark/CarbonFilters.scala --- @@ -1,397 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.carbondata.spark - -import scala.collection.mutable.ArrayBuffer - -import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.optimizer.AttributeReferenceWrapper -import org.apache.spark.sql.sources -import org.apache.spark.sql.types.StructType - -import org.apache.carbondata.core.metadata.datatype.DataType -import org.apache.carbondata.core.metadata.schema.table.CarbonTable -import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn -import org.apache.carbondata.core.scan.expression.{ColumnExpression => CarbonColumnExpression, Expression => CarbonExpression, LiteralExpression => CarbonLiteralExpression} -import org.apache.carbondata.core.scan.expression.conditional._ -import org.apache.carbondata.core.scan.expression.logical.{AndExpression, FalseExpression, OrExpression} -import org.apache.carbondata.spark.util.CarbonScalaUtil - -/** - * All filter conversions are done here. - */ -object CarbonFilters { - - /** - * Converts data sources filters to carbon filter predicates. - */ - def createCarbonFilter(schema: StructType, - predicate: sources.Filter): Option[CarbonExpression] = { -val dataTypeOf = schema.map(f => f.name -> f.dataType).toMap - -def createFilter(predicate: sources.Filter): Option[CarbonExpression] = { - predicate match { - -case sources.EqualTo(name, value) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.Not(sources.EqualTo(name, value)) => - Some(new NotEqualsExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.EqualNullSafe(name, value) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.Not(sources.EqualNullSafe(name, value)) => - Some(new NotEqualsExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.GreaterThan(name, value) => - Some(new GreaterThanExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.LessThan(name, value) => - Some(new LessThanExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.GreaterThanOrEqual(name, value) => - Some(new GreaterThanEqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.LessThanOrEqual(name, value) => - Some(new LessThanEqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.In(name, values) => - Some(new InExpression(getCarbonExpression(name), -new ListExpression( - convertToJavaList(values.map(f => getCarbonLiteralExpression(name, f)).toList -case sources.Not(sources.In(name, values)) => - Some(new NotInExpression(getCarbonExpression(name), -new ListExpression( - convertToJavaList(values.map(f => getCarbonLiteralExpression(name, f)).toList - -case sources.IsNull(name) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, null), true)) -case sources.IsNotNull(name) => - Some(new NotEqualsExpression(getCarbonExpression(name), -
[GitHub] incubator-carbondata pull request #625: [CARBONDATA-743] Remove redundant Ca...
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/625#discussion_r104424040 --- Diff: integration/spark2/src/main/scala/org/apache/carbondata/spark/CarbonFilters.scala --- @@ -1,397 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.carbondata.spark - -import scala.collection.mutable.ArrayBuffer - -import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.optimizer.AttributeReferenceWrapper -import org.apache.spark.sql.sources -import org.apache.spark.sql.types.StructType - -import org.apache.carbondata.core.metadata.datatype.DataType -import org.apache.carbondata.core.metadata.schema.table.CarbonTable -import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn -import org.apache.carbondata.core.scan.expression.{ColumnExpression => CarbonColumnExpression, Expression => CarbonExpression, LiteralExpression => CarbonLiteralExpression} -import org.apache.carbondata.core.scan.expression.conditional._ -import org.apache.carbondata.core.scan.expression.logical.{AndExpression, FalseExpression, OrExpression} -import org.apache.carbondata.spark.util.CarbonScalaUtil - -/** - * All filter conversions are done here. - */ -object CarbonFilters { - - /** - * Converts data sources filters to carbon filter predicates. - */ - def createCarbonFilter(schema: StructType, - predicate: sources.Filter): Option[CarbonExpression] = { -val dataTypeOf = schema.map(f => f.name -> f.dataType).toMap - -def createFilter(predicate: sources.Filter): Option[CarbonExpression] = { - predicate match { - -case sources.EqualTo(name, value) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.Not(sources.EqualTo(name, value)) => - Some(new NotEqualsExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.EqualNullSafe(name, value) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.Not(sources.EqualNullSafe(name, value)) => - Some(new NotEqualsExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.GreaterThan(name, value) => - Some(new GreaterThanExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.LessThan(name, value) => - Some(new LessThanExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.GreaterThanOrEqual(name, value) => - Some(new GreaterThanEqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) -case sources.LessThanOrEqual(name, value) => - Some(new LessThanEqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, value))) - -case sources.In(name, values) => - Some(new InExpression(getCarbonExpression(name), -new ListExpression( - convertToJavaList(values.map(f => getCarbonLiteralExpression(name, f)).toList -case sources.Not(sources.In(name, values)) => - Some(new NotInExpression(getCarbonExpression(name), -new ListExpression( - convertToJavaList(values.map(f => getCarbonLiteralExpression(name, f)).toList - -case sources.IsNull(name) => - Some(new EqualToExpression(getCarbonExpression(name), -getCarbonLiteralExpression(name, null), true)) -case sources.IsNotNull(name) => - Some(new NotEqualsExpression(getCarbonExpression(name), -
[jira] [Closed] (CARBONDATA-657) We are not able to create table with shared dictionary columns in spark 2.1
[ https://issues.apache.org/jira/browse/CARBONDATA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Payal closed CARBONDATA-657. Resolution: Invalid > We are not able to create table with shared dictionary columns in spark 2.1 > --- > > Key: CARBONDATA-657 > URL: https://issues.apache.org/jira/browse/CARBONDATA-657 > Project: CarbonData > Issue Type: Bug > Components: sql >Affects Versions: 1.0.0-incubating > Environment: Spark-2.1 >Reporter: Payal >Assignee: anubhav tarar >Priority: Minor > > We are not able to create table with shared dictionary columns not working > with spark-2.1 but it is working fine with spark 1.6 > spark 1.6 logs > 0: jdbc:hive2://localhost:1> CREATE TABLE uniq_shared_dictionary (CUST_ID > int,CUST_NAME String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ > timestamp, BIGINT_COLUMN1 bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 > decimal(30,10), DECIMAL_COLUMN2 decimal(36,10),Double_COLUMN1 double, > Double_COLUMN2 double,INTEGER_COLUMN1 int) STORED BY > 'org.apache.carbondata.format' > TBLPROPERTIES('DICTIONARY_INCLUDE'='CUST_ID,Double_COLUMN2,DECIMAL_COLUMN2','columnproperties.CUST_ID.shared_column'='shared.CUST_ID','columnproperties.decimal_column2.shared_column'='shared.decimal_column2'); > +-+--+ > | Result | > +-+--+ > +-+--+ > in spark 2.1 logs --- > 0: jdbc:hive2://hadoop-master:1> CREATE TABLE uniq_shared_dictionary > (CUST_ID int,CUST_NAME String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ > timestamp, BIGINT_COLUMN1 bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 > decimal(30,10), DECIMAL_COLUMN2 decimal(36,10),Double_COLUMN1 double, > Double_COLUMN2 double,INTEGER_COLUMN1 int) STORED BY > 'org.apache.carbondata.format' > TBLPROPERTIES('DICTIONARY_INCLUDE'='CUST_ID,Double_COLUMN2,DECIMAL_COLUMN2','columnproperties.CUST_ID.shared_column'='shared.CUST_ID','columnproperties.decimal_column2.shared_column'='shared.decimal_column2'); > Error: org.apache.carbondata.spark.exception.MalformedCarbonCommandException: > Invalid table properties columnproperties.cust_id.shared_column > (state=,code=0) > LOGS > ERROR 18-01 13:31:18,147 - Error executing query, currentState RUNNING, > org.apache.carbondata.spark.exception.MalformedCarbonCommandException: > Invalid table properties columnproperties.cust_id.shared_column > at > org.apache.carbondata.spark.util.CommonUtil$$anonfun$validateTblProperties$1.apply(CommonUtil.scala:141) > at > org.apache.carbondata.spark.util.CommonUtil$$anonfun$validateTblProperties$1.apply(CommonUtil.scala:137) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.con llection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.carbondata.spark.util.CommonUtil$.validateTblProperties(CommonUtil.scala:137) > at > org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(CarbonSparkSqlParser.scala:135) > at > org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(CarbonSparkSqlParser.scala:60) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$CreateTableContext.accept(SqlBaseParser.java:503) > at > org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:66) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:66) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:93) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:65) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:54) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82) > at > org.apache.spark.sql.parser.CarbonSparkSqlParser.parse(CarbonSparkSqlParser.scala:45) > but if we give column name in lower case in spark 2.1 it works fine > spark 2.1 > CREATE TABLE uniq_shared_dictionary (cust_id int,CUST_NAME > String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ timestamp, > BIGINT_COLUMN1 bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 decimal(30,10), > decimal_column2 decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 > double,INTEGER_COLUMN1 int) STORED BY 'org.apache.carbondata.format' >
[GitHub] incubator-carbondata issue #616: [CARBONDATA-708] Fixed Between Filter Issue...
Github user CarbonDataQA commented on the issue: https://github.com/apache/incubator-carbondata/pull/616 Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1015/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742]Added batch sort to ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/620#discussion_r104385171 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/sort/impl/UnsafeBatchParallelReadMergeSorterImpl.java --- @@ -0,0 +1,270 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.carbondata.processing.newflow.sort.impl; + +import java.util.Iterator; +import java.util.List; +import java.util.concurrent.BlockingQueue; +import java.util.concurrent.Callable; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.LinkedBlockingQueue; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.concurrent.atomic.AtomicLong; + +import org.apache.carbondata.common.CarbonIterator; +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.util.CarbonProperties; +import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory; +import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException; +import org.apache.carbondata.processing.newflow.row.CarbonRow; +import org.apache.carbondata.processing.newflow.row.CarbonRowBatch; +import org.apache.carbondata.processing.newflow.row.CarbonSortBatch; +import org.apache.carbondata.processing.newflow.sort.Sorter; +import org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage; +import org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows; +import org.apache.carbondata.processing.newflow.sort.unsafe.merger.UnsafeIntermediateMerger; +import org.apache.carbondata.processing.newflow.sort.unsafe.merger.UnsafeSingleThreadFinalSortFilesMerger; +import org.apache.carbondata.processing.sortandgroupby.exception.CarbonSortKeyAndGroupByException; +import org.apache.carbondata.processing.sortandgroupby.sortdata.SortParameters; +import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException; + +/** + * It parallely reads data from array of iterates and do merge sort. + * It sorts data in batches and send to the next step. + */ +public class UnsafeBatchParallelReadMergeSorterImpl implements Sorter { --- End diff -- Yes we do sort in-memory, it sorts the data chunk by chunk (default size 64 MB) and kept them in memory, once the batch memory reaches then it starts merge sort and gives to the data writer. This approach is faster than sort the big batch of records once. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...
Github user jarray888 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/624#discussion_r104374666 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.examples + +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} + +import org.apache.carbondata.core.util.CarbonProperties + +// scalastyle:off println +object CompareTest { + + val parquetTableName = "comparetest_parquet" + val carbonTableName = "comparetest_carbon" + + private def generateDataFrame(spark: SparkSession): DataFrame = { +import spark.implicits._ +spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4) +.map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x +* 5)) +.toDF("id", "country", "city", "c4", "c5", "c6", "c7", "c8") --- End diff -- can you add a column using decimal data type? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...
Github user jarray888 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/624#discussion_r104370904 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.examples + +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} + +import org.apache.carbondata.core.util.CarbonProperties + +// scalastyle:off println +object CompareTest { + + val parquetTableName = "comparetest_parquet" + val carbonTableName = "comparetest_carbon" + + private def generateDataFrame(spark: SparkSession): DataFrame = { +import spark.implicits._ +spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4) +.map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x --- End diff -- To simulate a real-life data, please make the data unsorted, like `map(x => ("i" + randon number, "p" + x % 13, "j" + x % 97, ...)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...
Github user jarray888 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/624#discussion_r104369883 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.examples + +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} + +import org.apache.carbondata.core.util.CarbonProperties + +// scalastyle:off println +object CompareTest { + + val parquetTableName = "comparetest_parquet" + val carbonTableName = "comparetest_carbon" + + private def generateDataFrame(spark: SparkSession): DataFrame = { +import spark.implicits._ +spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4) +.map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x +* 5)) +.toDF("id", "country", "city", "c4", "c5", "c6", "c7", "c8") + } + + private def loadParquetTable(spark: SparkSession, input: DataFrame): Long = timeit { +input.write.mode(SaveMode.Overwrite).parquet(parquetTableName) --- End diff -- suggest to use last char of id column to do partition on parquet, so the comparison is fare. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...
Github user jarray888 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/624#discussion_r104369679 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.examples + +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} + +import org.apache.carbondata.core.util.CarbonProperties + +// scalastyle:off println +object CompareTest { + + val parquetTableName = "comparetest_parquet" + val carbonTableName = "comparetest_carbon" + + private def generateDataFrame(spark: SparkSession): DataFrame = { +import spark.implicits._ +spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4) +.map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x +* 5)) +.toDF("id", "country", "city", "c4", "c5", "c6", "c7", "c8") + } + + private def loadParquetTable(spark: SparkSession, input: DataFrame): Long = timeit { +input.write.mode(SaveMode.Overwrite).parquet(parquetTableName) + } + + private def loadCarbonTable(spark: SparkSession, input: DataFrame): Long = { +spark.sql(s"drop table if exists $carbonTableName") +timeit { + input.write + .format("carbondata") + .option("tableName", carbonTableName) + .option("tempCSV", "false") + .option("single_pass", "true") + .option("dictionary_exclude", "id") // id is high cardinality column + .mode(SaveMode.Overwrite) + .save() +} + } + + private def prepareTable(spark: SparkSession): Unit = { +val df = generateDataFrame(spark).cache() +println(s"loading dataframe into table, schema: ${df.schema}") +val loadParquetTime = loadParquetTable(spark, df) +val loadCarbonTime = loadCarbonTable(spark, df) +println(s"load completed, time: $loadParquetTime, $loadCarbonTime") + spark.read.parquet(parquetTableName).registerTempTable(parquetTableName) + } + + private def runQuery(spark: SparkSession): Unit = { +val test = Array( + "select count(*) from $table", + "select sum(c4) from $table", + "select sum(c4), sum(c5) from $table", + "select sum(c4), sum(c5), sum(c6) from $table", + "select sum(c4), sum(c5), sum(c6), sum(c7) from $table", + "select sum(c4), sum(c5), sum(c6), sum(c7), avg(c8) from $table", + "select * from $table where id = 'i999' ", + "select * from $table where country = 'p9' ", + "select * from $table where city = 'j99' ", + "select * from $table where c4 < 1000 " --- End diff -- please add more testcase, for example: "select sum(c4) from $table where id like 'i1%' " "select sum(c4) from $table where id like '%10' " "select sum(c4) from $table where id like '%xyz%' " --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata issue #625: [CARBONDATA-743] Remove redundant CarbonFil...
Github user CarbonDataQA commented on the issue: https://github.com/apache/incubator-carbondata/pull/625 Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1014/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---