Re:[DISCUSSION] Order by + Limit Optimization
please ignore the email. my mistake, the mail is not finished. I will sent a new mail later At 2017-03-28 13:24:34, "马云"wrote: >Hi Dev, > > >currently I have done an optimization for order by 1 dimension. >performance test as below > > >my optimization solution for order by 1 dimension as below >mainly leverage the dimension's order stored feature in each blocklet >step1. change logical plan and push down the order by and limit information to >carbon scan >and change sort physical plan to TakeOrderedAndProject since data > will be get and sorted in each partition >step2. apply the limit number, blocklet's min_max index to filter blocklet. > So it can reduce some scan time if some blocklets was filtered > > >step3. in each partition,load the order by dimension data for all blocklet >which is filter > > >
[DISCUSSION] Order by + Limit Optimization
Hi Dev, currently I have done an optimization for order by 1 dimension. performance test as below my optimization solution for order by 1 dimension as below mainly leverage the dimension's order stored feature in each blocklet step1. change logical plan and push down the order by and limit information to carbon scan and change sort physical plan to TakeOrderedAndProject since data will be get and sorted in each partition step2. apply the limit number, blocklet's min_max index to filter blocklet. So it can reduce some scan time if some blocklets was filtered step3. in each partition,load the order by dimension data for all blocklet which is filter
Re: carbondata find a bug
Hi tianli First, please send mail to dev-subscr...@carbondata.incubator.apache.org for joining mailing list group. Then you can send and receive mail from dev@carbondata.incubator.apache.org. Can you raise one JIRA at https://issues.apache.org/jira/browse/CARBONDATA, and raise one pull request for fixing it . Regards Liang 2017-03-28 9:41 GMT+05:30 Tian Li 田力: > hi: > org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD > find a bug: > code line 378-380 > > if (model.isFirstLoad && model.highCardIdentifyEnable > && !model.isComplexes(split.index) > && model.dimensions(split.index).isColumnar) { > > model.dimensions(split.index).isColumnar must change to > model.primDimensions(split.index).isColumnar because > model.isComplexes.length may be != model.dimensions.length when create > table use DICTIONARY_EXCLUDE > > > -- > 田力 Tian Li > 高级工程师 Senior Engineer > > 北京(100192)|上海|广州|西安|成都 > 北京 海淀区 西小口路66号 中关村东升科技园B区2号楼3层D区 > 3/F Zone D, Dongsheng Technology Park Building B-2, > No.66 Xixiaokou Rd, Haidian Dist., Beijing, China > > > > Disclaimer The information in this email and any attachments may contain > proprietary and confidential information that is intended for the > addressee(s) only. If you are not the intended recipient, you are hereby > notified that any disclosure, copying, distribution, retention or use of > the contents of this information is prohibited. When addressed to our > clients or vendors, any information contained in this e-mail or any > attachments is subject to the terms and conditions in any governing > contract.
pyspark carbondata
Use python to query carbondata through spark/sparksql?
Re: carbondata find a bug
+1 Best Regards David QiangCai -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-find-a-bug-tp9747p9749.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
Re: question about dimension's sort order in blocklet level
Hi Can you provide one table to show your info, can't see very clear? The column of high cardinality(>100) would not do dictionary. Regards Liang 2017-03-27 14:32 GMT+05:30 马云: > Hi DEV, > > I create table according to the below SQL > > cc.sql(""" > >CREATE TABLE IF NOT EXISTS t3 > >(ID Int, > > date Timestamp, > > country String, > > name String, > > phonetype String, > > serialname String, > > salary Int, > > name1 String, > > name2 String, > > name3 String, > > name4 String, > > name5 String, > > name6 String, > > name7 String, > > name8 String > >) > >STORED BY 'carbondata' > >""") > > > > data cardinality as below. > > | > > column cardinality > > | > | > > name > > | > > name1 > > | > > name2 > > | > > name3 > > | > > name4 > > | > > name5 > > | > > name6 > > | > > name7 > > | > > name8 > > | > | > > 1000 > > | > > 1000 > > | > > 1000 > > | > > 1000 > > | > > 1000 > > | > > 1000 > > | > > 1000 > > | > > 1000 > > | > > 1000 > > | > > > > after I load data to this table, I found the dimension columns "name" and > "name7" both have no dictionary encode. > > but column "name" has no inverted index and column "name7" has inverted > index > > questions: > > 1. the dimension column name has dictionary decode, but have no inverted > index, does its' data still have order in DataChunk2 blocklet? > > 2. is there any document to introduce these loading strategies? > > > 3. if a dimension column has no dictionary decode and no inverted > index, user also didn't specify the column with no inverted index when > create table > does its' data still have order in DataChunk2 blocklet? > > 4. as I know, by default, all dimension column data are sorted and stored > in DataChunk2 blocklet except user specify the column with no inverted > index, right? > > 5. as I know the first dimension column of mdk key is always sorted in > DataChunk2 blocklet, why not set the isExplicitSorted to true? > > > > the attached is used to generate the data.csv > > package test; > > > > > import java.io.BufferedOutputStream; > > import java.io.File; > > import java.io.FileOutputStream; > > import java.io.FileWriter; > > import java.util.HashMap; > > import java.util.Map; > > > > > publicclass CreateData { > > > > > public CreateData() { > > > > > } > > > > > publicstaticvoid main(String[] args) { > > > > > FileOutputStream out = null; > > > > > FileOutputStream outSTr = null; > > > > > BufferedOutputStream Buff = null; > > > > > FileWriter fw = null; > > > > > intcount = 1000;// 写文件行数 > > > > > try { > > > > > outSTr = new FileOutputStream(new File("data.csv")); > > > > > Buff = new BufferedOutputStream(outSTr); > > > > > longbegin0 = System.currentTimeMillis(); > > Buff.write( > > "ID,date,country,name,phonetype,serialname,salary, > name1,name2,name3,name4,name5,name6,name7,name8\n" > > .getBytes()); > > > > > intidcount = 1000; > > intdatecount = 30; > > intcountrycount = 5; > > // intnamecount =500; > > intphonetypecount = 1; > > intserialnamecount = 5; > > // intsalarycount = 20; > > Map countryMap = new HashMap (); > > countryMap.put(1, "usa"); > > countryMap.put(2, "uk"); > > countryMap.put(3, "china"); > > countryMap.put(4, "indian"); > > countryMap.put(0, "canada"); > > > > > StringBuilder sb = null; > > for (inti = idcount; i >= 0; i--) { > > > > > sb = new StringBuilder(); > > sb.append(400 + i).append(",");// id > > sb.append("2015/8/" + (i % datecount + 1)).append(","); > > sb.append(countryMap.get(i % countrycount)).append(","); > > sb.append("name" + (160 - i)).append(",");// name > > sb.append("phone" + i % phonetypecount).append(","); > > sb.append("serialname" + (10 + i % > serialnamecount)).append(",");// serialname > > sb.append(i + 50).append(","); > > sb.append("name1" + (i + 10)).append(",");// name > > sb.append("name2" + (i + 20)).append(",");// name > > sb.append("name3" + (i + 30)).append(",");// name > > sb.append("name4" + (i + 40)).append(",");// name > > sb.append("name5" + (i + 50)).append(",");// name > > sb.append("name6" + (i + 60)).append(",");// name > > sb.append("name7" + (i + 70)).append(",");// name > > sb.append("name8" + (i + 80)).append(",").append('\n'); > > > > > Buff.write(sb.toString().getBytes()); > > > > > } > > > > > Buff.flush(); > > > > > Buff.close(); > > System.out.println("sb.toString():" + sb.toString()); > > longend0 = System.currentTimeMillis(); > > > > > System.out.println("BufferedOutputStream执行耗时:" +
carbondata find a bug
hi: org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD find a bug: code line 378-380 if (model.isFirstLoad && model.highCardIdentifyEnable && !model.isComplexes(split.index) && model.dimensions(split.index).isColumnar) { model.dimensions(split.index).isColumnar must change to model.primDimensions(split.index).isColumnar because model.isComplexes.length may be != model.dimensions.length when create table use DICTIONARY_EXCLUDE -- 田力 Tian Li 高级工程师 Senior Engineer 北京(100192)|上海|广州|西安|成都 北京 海淀区 西小口路66号 中关村东升科技园B区2号楼3层D区 3/F Zone D, Dongsheng Technology Park Building B-2, No.66 Xixiaokou Rd, Haidian Dist., Beijing, China Disclaimer The information in this email and any attachments may contain proprietary and confidential information that is intended for the addressee(s) only. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, retention or use of the contents of this information is prohibited. When addressed to our clients or vendors, any information contained in this e-mail or any attachments is subject to the terms and conditions in any governing contract.
Re: data not input hive
Now Spark persist data source table into Hive metastore in Spark SQL specific format. This is not a bug. -- Original -- From: "";<1141982...@qq.com>; Date: Mon, Mar 27, 2017 04:47 PM To: "dev"; Subject: data not input hive spark 2.1.0 hive 1.2.1 Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.CarbonSource. Persisting data source table `default`.`carbon_table30` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Re: [apache/incubator-carbondata] [CARBONDATA-727][WIP] addhiveintegration for carbon (#672)
Hi, Anubhav: Do you use mysql to store the hive metadata?spark sql and hive must use the same metastore. PS: Before you query data using hive, you should alter table schema. This is the latest guide. https://github.com/cenyuhai/incubator-carbondata/blob/CARBONDATA-727/integration/hive/hive-guide.md -- Original -- From: "Anubhav Tarar";; Date: Mon, Mar 27, 2017 02:59 PM To: "dev" ; Cc: "chenliang613" ; "Mention" ; Subject: Re: [apache/incubator-carbondata] [CARBONDATA-727][WIP] addhiveintegration for carbon (#672) @sea hi i tried to use hive with the steps you mentioned from you pr but get table not found exception from hive cli, here are the steps i use 1.start the spark shell with hive and carbon bulids ./spark-shell --jars /home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/ carbondata_2.11-1.incubating-SNAPSHOT-shade-hadoop2.7.2. jar,/home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/carbondata-hive-1.1.0- incubating-SNAPSHOT.jar 2.create the carbonsession and create and load tables scala> import org.apache.spark.sql.CarbonSession._ import org.apache.spark.sql.CarbonSession._ scala> import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkSession scala> val carbon = SparkSession.builder().enableHiveSupport().config(sc. getConf).getOrCreateCarbonSession("hdfs://localhost:54310/opt/carbonStore") scala>carbon.sql("create table hive_carbon(id int, name string, scale decimal, country string, salary double) STORED BY 'carbondata'") scala>carbon.sql("LOAD DATA INPATH 'hdfs://localhost:54310/sample.csv' INTO TABLE hive_carbon") 3.start hive cli and added the jars hive> add jar /home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/ carbondata-hive-1.1.0-incubating-SNAPSHOT.jar; Added [/home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/ carbondata-hive-1.1.0-incubating-SNAPSHOT.jar] to class path Added resources: [/home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/ carbondata-hive-1.1.0-incubating-SNAPSHOT.jar] hive> add jar /home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/ carbondata_2.11-1.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar; Added [/home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/ carbondata_2.11-1.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar] to class path Added resources: [/home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/ carbondata_2.11-1.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar] hive> add jar /home/hduser/spark-2.1.0-bin-hadoop2.7/jars/spark-catalyst_ 2.11-2.1.0.jar; Added [/home/hduser/spark-2.1.0-bin-hadoop2.7/jars/spark-catalyst_2.11-2.1.0.jar] to class path Added resources: [/home/hduser/spark-2.1.0-bin- hadoop2.7/jars/spark-catalyst_2.11-2.1.0.jar] 4.query data using hive hive> select * from hive_carbon; FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'hive_carbon' On Fri, Mar 24, 2017 at 9:30 AM, Sea <261810...@qq.com> wrote: > I forgot something. > Before query data from hive. We should set > set hive.mapred.supports.subdirectories=true; > set mapreduce.input.fileinputformat.input.dir.recursive=true; > > > -- Original -- > From: "261810726";<261810...@qq.com>; > Date: Thu, Mar 23, 2017 09:58 PM > To: "chenliang613" ; "dev" incubator.apache.org>; > Cc: "Mention" ; > Subject: Re: [apache/incubator-carbondata] [CARBONDATA-727][WIP] add > hiveintegration for carbon (#672) > > > > Hi, liang: > I create a new profile "integration/hive" and the CI is OK now. But I > still have some problems in altering hive metastore schema. > My steps are as following: > > 1.build carbondata > > > mvn -DskipTests -Pspark-2.1 -Dspark.version=2.1.0 clean package > -Phadoop-2.7.2 -Phive-1.2.1 > > > > 2.copy jars > > > mkdir ~/spark-2.1/carbon_lib > cp ~/cenyuhai/incubator-carbondata/assembly/target/ > scala-2.11/carbondata_2.11-1.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar > ~/spark-2.1/carbon_lib/ > cp ~/cenyuhai/incubator-carbondata/integration/hive/ > target/carbondata-hive-1.1.0-incubating-SNAPSHOT.jar > ~/spark-2.1/carbon_lib/ > > > > 3.create sample.csv and put it into hdfs > > > id,name,scale,country,salary > 1,yuhai,1.77,china,33000.0 > 2,runlin,1.70,china,32000.0 > > > > 4.create table in spark > > > spark-shell --jars "/data/hadoop/spark-2.1/carbon_lib/carbondata_2.11-1. > 1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar,/data/hadoop/ > spark-2.1/carbon_lib/carbondata-hive-1.1.0-incubating-SNAPSHOT.jar" > > > #execute these commands: > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.CarbonSession._ > val rootPath = "hdfs:user/hadoop/carbon" > val storeLocation = s"$rootPath/store" > val warehouse = s"$rootPath/warehouse" > val metastoredb = s"$rootPath/metastore_db" > > > val carbon = >
[jira] [Created] (CARBONDATA-827) Query statistics log format is incorrect
Jacky Li created CARBONDATA-827: --- Summary: Query statistics log format is incorrect Key: CARBONDATA-827 URL: https://issues.apache.org/jira/browse/CARBONDATA-827 Project: CarbonData Issue Type: Bug Reporter: Jacky Li The output log for query statistics has repeated numbers which is incorrect -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[DISCUSSION]: (New Feature) Streaming Ingestion into CarbonData
Hi All, I would like to open up a discussion for new feature to support streaming ingestion in CarbonData. Please refer to design document(draft) in the link below. https://drive.google.com/file/d/0B71_EuXTdDi8MlFDU2tqZU9BZ3M /view?usp=sharing Your comments/suggestions are welcome. Here are some high level points. Rationale: The current ways of adding user data to CarbonData table is via LOAD statement or using SELECT query with INSERT INTO statement. These methods add bulk of data into CarbonData table into a new segment. Basically, it is a batch insertion for a bulk of data. However, with increasing demand of real time data analytics with streaming frameworks, CarbonData needs a way to insert streaming data continuously into CarbonData table. CarbonData needs a support for continuous and faster ingestion into CarbonData table and make it available for querying. CarbonData can leverage from our newly introduced V3 format to append streaming data to existing carbon table. Requirements: Following are some high level requirements; 1. CarbonData shall create a new segment (Streaming Segment) for each streaming session. Concurrent streaming ingestion into same table will create separate streaming segments. 2. CarbonData shall use write optimized format (instead of multi-layered indexed columnar format) to support ingestion of streaming data into a CarbonData table. 3. CarbonData shall create streaming segment folder and open a streaming data file in append mode to write data. CarbonData should avoid creating multiple small files by appending to an existing file. 4. The data stored in new streaming segment shall be available for query after it is written to the disk (hflush/hsync). In other words, CarbonData Readers should be able to query the data in streaming segment written so far. 5. CarbonData should acknowledge the write operation status back to output sink/upper layer streaming engine so that in the case of write failure, streaming engine should restart the operation and maintain exactly once delivery semantics. 6. CarbonData Compaction process shall support compacting data from write-optimized streaming segment to regular read optimized columnar CarbonData format. 7. CarbonData readers should maintain the read consistency by means of using timestamp. 8. Maintain durability - in case of write failure, CarbonData should be able recover to latest commit status. This may require maintaining source and destination offsets of last commits in a metadata. This feature can be done in phases; Phase -1 : Add basic framework and writer support to allow Spark Structured streaming into CarbonData . This phase may or may not have append support. Add reader support to read streaming data files. Phase-2 : Add append support if not done in phase 1. Maintain append offsets and metadata information. Phase -3 : Add support for external streaming frameworks such as Kafka streaming using spark structured steaming, maintain topics/partitions/offsets and support fault tolerance . Phase-4 : Add support to other streaming frameworks , such as flink , beam etc. Phase-5: Future support for in-memory cache for buffering streaming data, support for union with Spark Structured streaming to serve directly from spark structured streaming. And add support for Time series data. Best Regards, Aniket
Re: data not input hive
Hi, Carbon does not support load data using Hive yet. You can use Spark to load. Regards, Jacky > 在 2017年3月27日,下午2:17,风云际会 <1141982...@qq.com> 写道: > > spark 2.1.0 > hive 1.2.1 > Couldn't find corresponding Hive SerDe for data source provider > org.apache.spark.sql.CarbonSource. Persisting data source table > `default`.`carbon_table30` into Hive metastore in Spark SQL specific format, > which is NOT compatible with Hive.
data not input hive
spark 2.1.0 hive 1.2.1 Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.CarbonSource. Persisting data source table `default`.`carbon_table30` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
[jira] [Created] (CARBONDATA-825) upload or delete problem
sehriff created CARBONDATA-825: -- Summary: upload or delete problem Key: CARBONDATA-825 URL: https://issues.apache.org/jira/browse/CARBONDATA-825 Project: CarbonData Issue Type: Bug Reporter: sehriff can't update/delete successfully like cc.sql(update/delete ...) unless add show syntax like cc.sql(updae/delete...).show, beeline client can't update/delete successfully neither in thriftserver mode. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (CARBONDATA-826) Create carbondata-connector of presto for supporting presto query carbon data
Liang Chen created CARBONDATA-826: - Summary: Create carbondata-connector of presto for supporting presto query carbon data Key: CARBONDATA-826 URL: https://issues.apache.org/jira/browse/CARBONDATA-826 Project: CarbonData Issue Type: Sub-task Components: presto-integration Reporter: Liang Chen Assignee: Liang Chen Priority: Minor 1.In CarbonData project, generate carbondata-connector of presto 2.Copy carbondata-connector to presto/plugin/ 3.Run query in presto to read carbon data. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
Re: Re:Re:Re:Re:Re:Re: insert into carbon table failed
Hi Please enable vector , it might help limit query. import org.apache.carbondata.core.util.CarbonProperties import org.apache.carbondata.core.constants.CarbonCommonConstants CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER, "true") Regards Liang a wrote > TEST SQL : > 高基数随机查询 > select * From carbon_table where dt='2017-01-01' and user_id='' limit > 100; > > > 高基数随机查询like > select * From carbon_table where dt='2017-01-01' and fo like '%%' > limit 100; > > > 低基数随机查询 > select * From carbon_table where dt='2017-01-01' and plat='android' and > tv='8400' limit 100 > > > 1维度查询 > select province,sum(play_pv) play_pv ,sum(spt_cnt) spt_cnt > from carbon_table where dt='2017-01-01' and sty='' > group by province > > > 2维度查询 > select province,city,sum(play_pv) play_pv ,sum(spt_cnt) spt_cnt > from carbon_table where dt='2017-01-01' and sty='' > group by province,city > > > 3维度查询 > select province,city,isp,sum(play_pv) play_pv ,sum(spt_cnt) spt_cnt > from carbon_table where dt='2017-01-01' and sty='' > group by province,city,isp > > > 多维度查询 > select sty,isc,status,nw,tv,area,province,city,isp,sum(play_pv) > play_pv_sum ,sum(spt_cnt) spt_cnt_sum > from carbon_table where dt='2017-01-01' and sty='' > group by sty,isc,status,nw,tv,area,province,city,isp > > > distinct 单列 > select tv, count(distinct user_id) > from carbon_table where dt='2017-01-01' and sty='' and fo like > '%%' group by tv > > > distinct 多列 > select count(distinct user_id) ,count(distinct mid),count(distinct case > when sty='' then mid end) > from carbon_table where dt='2017-01-01' and sty='' > > > 排序查询 > select user_id,sum(play_pv) play_pv_sum > from carbon_table > group by user_id > order by play_pv_sum desc limit 100 > > > 简单join查询 > select b.fo_level1,b.fo_level2,sum(a.play_pv) play_pv_sum From > carbon_table a > left join dim_carbon_table b > on a.fo=b.fo and a.dt = b.dt where a.dt = '2017-01-01' group by > b.fo_level1,b.fo_level2 > > At 2017-03-27 04:10:04, "a" > wwyxg@ > wrote: >>I download the newest sourcecode (master) and compile,generate the jar carbondata_2.11-1.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar >>Then i use spark2.1 test again.The error logs are as follow: >> >> >> Container log : >>17/03/27 02:27:21 ERROR newflow.DataLoadExecutor: Executor task launch worker-9 Data Loading failed for table carbon_table >>java.lang.NullPointerException >>at >> org.apache.carbondata.processing.newflow.DataLoadProcessBuilder.createConfiguration(DataLoadProcessBuilder.java:158) >>at >> org.apache.carbondata.processing.newflow.DataLoadProcessBuilder.build(DataLoadProcessBuilder.java:60) >>at >> org.apache.carbondata.processing.newflow.DataLoadExecutor.execute(DataLoadExecutor.java:43) >>at org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD$$anon$2. > > (NewCarbonDataLoadRDD.scala:365) >>at >> org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD.compute(NewCarbonDataLoadRDD.scala:322) >>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >>at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >>at >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >>at org.apache.spark.scheduler.Task.run(Task.scala:89) >>at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) >>at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>at java.lang.Thread.run(Thread.java:745) >>17/03/27 02:27:21 INFO rdd.NewDataFrameLoaderRDD: DataLoad failure >>org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException: Data Loading failed for table carbon_table >>at >> org.apache.carbondata.processing.newflow.DataLoadExecutor.execute(DataLoadExecutor.java:54) >>at org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD$$anon$2. > > (NewCarbonDataLoadRDD.scala:365) >>at >> org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD.compute(NewCarbonDataLoadRDD.scala:322) >>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >>at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >>at >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >>at org.apache.spark.scheduler.Task.run(Task.scala:89) >>at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) >>at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>at java.lang.Thread.run(Thread.java:745) >>Caused by: java.lang.NullPointerException >>at >> org.apache.carbondata.processing.newflow.DataLoadProcessBuilder.createConfiguration(DataLoadProcessBuilder.java:158) >>at >>
Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release
+1 Regards Manish Gupta On Mon, Mar 27, 2017 at 2:41 PM, Kumar Vishalwrote: > +1 > -Regards > Kumar Vishal > > On Mar 27, 2017 09:31, "xm_zzc" <441586...@qq.com> wrote: > > > Hi, Liang: > > Thanks for your reply. > > > > > > > > -- > > View this message in context: http://apache-carbondata- > > mailing-list-archive.1130556.n5.nabble.com/Re-DISCUSSION- > > Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9672p9680.html > > Sent from the Apache CarbonData Mailing List archive mailing list archive > > at Nabble.com. > > >
question about dimension's sort order in blocklet level
Hi DEV, I create table according to the below SQL cc.sql(""" CREATE TABLE IF NOT EXISTS t3 (ID Int, date Timestamp, country String, name String, phonetype String, serialname String, salary Int, name1 String, name2 String, name3 String, name4 String, name5 String, name6 String, name7 String, name8 String ) STORED BY 'carbondata' """) data cardinality as below. | column cardinality | | name | name1 | name2 | name3 | name4 | name5 | name6 | name7 | name8 | | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | after I load data to this table, I found the dimension columns "name" and "name7" both have no dictionary encode. but column "name" has no inverted index and column "name7" has inverted index questions: 1. the dimension column name has dictionary decode, but have no inverted index, does its' data still have order in DataChunk2 blocklet? 2. is there any document to introduce these loading strategies? 3. if a dimension column has no dictionary decode and no inverted index, user also didn't specify the column with no inverted index when create table does its' data still have order in DataChunk2 blocklet? 4. as I know, by default, all dimension column data are sorted and stored in DataChunk2 blocklet except user specify the column with no inverted index, right? 5. as I know the first dimension column of mdk key is always sorted in DataChunk2 blocklet, why not set the isExplicitSorted to true? the attached is used to generate the data.csv package test; import java.io.BufferedOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.FileWriter; import java.util.HashMap; import java.util.Map; publicclass CreateData { public CreateData() { } publicstaticvoid main(String[] args) { FileOutputStream out = null; FileOutputStream outSTr = null; BufferedOutputStream Buff = null; FileWriter fw = null; intcount = 1000;// 写文件行数 try { outSTr = new FileOutputStream(new File("data.csv")); Buff = new BufferedOutputStream(outSTr); longbegin0 = System.currentTimeMillis(); Buff.write( "ID,date,country,name,phonetype,serialname,salary,name1,name2,name3,name4,name5,name6,name7,name8\n" .getBytes()); intidcount = 1000; intdatecount = 30; intcountrycount = 5; // intnamecount =500; intphonetypecount = 1; intserialnamecount = 5; // intsalarycount = 20; MapcountryMap = new HashMap (); countryMap.put(1, "usa"); countryMap.put(2, "uk"); countryMap.put(3, "china"); countryMap.put(4, "indian"); countryMap.put(0, "canada"); StringBuilder sb = null; for (inti = idcount; i >= 0; i--) { sb = new StringBuilder(); sb.append(400 + i).append(",");// id sb.append("2015/8/" + (i % datecount + 1)).append(","); sb.append(countryMap.get(i % countrycount)).append(","); sb.append("name" + (160 - i)).append(",");// name sb.append("phone" + i % phonetypecount).append(","); sb.append("serialname" + (10 + i % serialnamecount)).append(",");// serialname sb.append(i + 50).append(","); sb.append("name1" + (i + 10)).append(",");// name sb.append("name2" + (i + 20)).append(",");// name sb.append("name3" + (i + 30)).append(",");// name sb.append("name4" + (i + 40)).append(",");// name sb.append("name5" + (i + 50)).append(",");// name sb.append("name6" + (i + 60)).append(",");// name sb.append("name7" + (i + 70)).append(",");// name sb.append("name8" + (i + 80)).append(",").append('\n'); Buff.write(sb.toString().getBytes()); } Buff.flush(); Buff.close(); System.out.println("sb.toString():" + sb.toString()); longend0 = System.currentTimeMillis(); System.out.println("BufferedOutputStream执行耗时:" + (end0 - begin0) + " 豪秒"); } catch (Exception e) { e.printStackTrace(); } finally { try { // fw.close(); Buff.close(); outSTr.close(); // out.close(); } catch (Exception e) { e.printStackTrace(); } } } }
Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release
+1 -Regards Kumar Vishal On Mar 27, 2017 09:31, "xm_zzc" <441586...@qq.com> wrote: > Hi, Liang: > Thanks for your reply. > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/Re-DISCUSSION- > Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9672p9680.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. >
Re: insert into carbon table failed
I guess then word node in "Carbodata launches one job per each node to sort the data at node level and avoid shuffling" may make some confuse. I guess carbondata should launches one task per each executor . here job should be task ,node should be executor. Maybe he can try increase the number of executors to avoid memory problem
Carbondata resolve kettle dependencies fail
As the attachment shows: Carbondata resolve kettle dependencies fail. Can anyone know how to fix this? Also,if you use exclusion in maven to exclude the kettle, your project will fail even in compile time.
Re: [jira] [Created] (CARBONDATA-824) Null pointer Exception display to user while performance Testing
First, can you try change the create table statement ends with STORED BY 'carbondata' instead of STORED BY 'org.apache.carbondata.format'; Second , can you give some sample data instead of trying to upload 32GB CSV file.
[jira] [Created] (CARBONDATA-824) Null pointer Exception display to user while performance Testing
Vinod Rohilla created CARBONDATA-824: Summary: Null pointer Exception display to user while performance Testing Key: CARBONDATA-824 URL: https://issues.apache.org/jira/browse/CARBONDATA-824 Project: CarbonData Issue Type: Bug Components: data-query Affects Versions: 1.0.1-incubating Environment: SPARK 2.1 Reporter: Vinod Rohilla Displays null pointer exception to the user while select Query. Steps to reproduces: 1: Create table: CREATE TABLE oscon_new_1 (ACTIVE_AREA_ID String, ACTIVE_CHECK_DY String, ACTIVE_CHECK_HOUR String, ACTIVE_CHECK_MM String, ACTIVE_CHECK_TIME String, ACTIVE_CHECK_YR String, ACTIVE_CITY String, ACTIVE_COUNTRY String, ACTIVE_DISTRICT String, ACTIVE_EMUI_VERSION String, ACTIVE_FIRMWARE_VER String, ACTIVE_NETWORK String, ACTIVE_OS_VERSION String, ACTIVE_PROVINCE String, BOM String, CHECK_DATE String, CHECK_DY String, CHECK_HOUR String, CHECK_MM String, CHECK_YR String, CUST_ADDRESS_ID String, CUST_AGE String, CUST_BIRTH_COUNTRY String, CUST_BIRTH_DY String, CUST_BIRTH_MM String, CUST_BIRTH_YR String, CUST_BUY_POTENTIAL String, CUST_CITY String, CUST_STATE String, CUST_COUNTRY String, CUST_COUNTY String, CUST_EMAIL_ADDR String, CUST_LAST_RVW_DATE TIMESTAMP, CUST_FIRST_NAME String, CUST_ID String, CUST_JOB_TITLE String, CUST_LAST_NAME String, CUST_LOGIN String, CUST_NICK_NAME String, CUST_PRFRD_FLG String, CUST_SEX String, CUST_STREET_NAME String, CUST_STREET_NO String, CUST_SUITE_NO String, CUST_ZIP String, DELIVERY_CITY String, DELIVERY_STATE String, DELIVERY_COUNTRY String, DELIVERY_DISTRICT String, DELIVERY_PROVINCE String, DEVICE_NAME String, INSIDE_NAME String, ITM_BRAND String, ITM_BRAND_ID String, ITM_CATEGORY String, ITM_CATEGORY_ID String, ITM_CLASS String, ITM_CLASS_ID String, ITM_COLOR String, ITM_CONTAINER String, ITM_FORMULATION String, ITM_MANAGER_ID String, ITM_MANUFACT String, ITM_MANUFACT_ID String, ITM_ID String, ITM_NAME String, ITM_REC_END_DATE String, ITM_REC_START_DATE String, LATEST_AREAID String, LATEST_CHECK_DY String, LATEST_CHECK_HOUR String, LATEST_CHECK_MM String, LATEST_CHECK_TIME String, LATEST_CHECK_YR String, LATEST_CITY String, LATEST_COUNTRY String, LATEST_DISTRICT String, LATEST_EMUI_VERSION String, LATEST_FIRMWARE_VER String, LATEST_NETWORK String, LATEST_OS_VERSION String, LATEST_PROVINCE String, OL_ORDER_DATE String, OL_ORDER_NO INT, OL_RET_ORDER_NO String, OL_RET_DATE String, OL_SITE String, OL_SITE_DESC String, PACKING_DATE String, PACKING_DY String, PACKING_HOUR String, PACKING_LIST_NO String, PACKING_MM String, PACKING_YR String, PRMTION_ID String, PRMTION_NAME String, PRM_CHANNEL_CAT String, PRM_CHANNEL_DEMO String, PRM_CHANNEL_DETAILS String, PRM_CHANNEL_DMAIL String, PRM_CHANNEL_EMAIL String, PRM_CHANNEL_EVENT String, PRM_CHANNEL_PRESS String, PRM_CHANNEL_RADIO String, PRM_CHANNEL_TV String, PRM_DSCNT_ACTIVE String, PRM_END_DATE String, PRM_PURPOSE String, PRM_START_DATE String, PRODUCT_ID String, PROD_BAR_CODE String, PROD_BRAND_NAME String, PRODUCT_NAME String, PRODUCT_MODEL String, PROD_MODEL_ID String, PROD_COLOR String, PROD_SHELL_COLOR String, PROD_CPU_CLOCK String, PROD_IMAGE String, PROD_LIVE String, PROD_LOC String, PROD_LONG_DESC String, PROD_RAM String, PROD_ROM String, PROD_SERIES String, PROD_SHORT_DESC String, PROD_THUMB String, PROD_UNQ_DEVICE_ADDR String, PROD_UNQ_MDL_ID String, PROD_UPDATE_DATE String, PROD_UQ_UUID String, SHP_CARRIER String, SHP_CODE String, SHP_CONTRACT String, SHP_MODE_ID String, SHP_MODE String, STR_ORDER_DATE String, STR_ORDER_NO String, TRACKING_NO String, WH_CITY String, WH_COUNTRY String, WH_COUNTY String, WH_ID String, WH_NAME String, WH_STATE String, WH_STREET_NAME String, WH_STREET_NO String, WH_STREET_TYPE String, WH_SUITE_NO String, WH_ZIP String, CUST_DEP_COUNT DOUBLE, CUST_VEHICLE_COUNT DOUBLE, CUST_ADDRESS_CNT DOUBLE, CUST_CRNT_CDEMO_CNT DOUBLE, CUST_CRNT_HDEMO_CNT DOUBLE, CUST_CRNT_ADDR_DM DOUBLE, CUST_FIRST_SHIPTO_CNT DOUBLE, CUST_FIRST_SALES_CNT DOUBLE, CUST_GMT_OFFSET DOUBLE, CUST_DEMO_CNT DOUBLE, CUST_INCOME DOUBLE, PROD_UNLIMITED INT, PROD_OFF_PRICE DOUBLE, PROD_UNITS INT, TOTAL_PRD_COST DOUBLE, TOTAL_PRD_DISC DOUBLE, PROD_WEIGHT DOUBLE, REG_UNIT_PRICE DOUBLE, EXTENDED_AMT DOUBLE, UNIT_PRICE_DSCNT_PCT DOUBLE, DSCNT_AMT DOUBLE, PROD_STD_CST DOUBLE, TOTAL_TX_AMT DOUBLE, FREIGHT_CHRG DOUBLE, WAITING_PERIOD DOUBLE, DELIVERY_PERIOD DOUBLE, ITM_CRNT_PRICE DOUBLE, ITM_UNITS DOUBLE, ITM_WSLE_CST DOUBLE, ITM_SIZE DOUBLE, PRM_CST DOUBLE, PRM_RESPONSE_TARGET DOUBLE, PRM_ITM_DM DOUBLE, SHP_MODE_CNT DOUBLE, WH_GMT_OFFSET DOUBLE, WH_SQ_FT DOUBLE, STR_ORD_QTY DOUBLE, STR_WSLE_CST DOUBLE, STR_LIST_PRICE DOUBLE, STR_SALES_PRICE DOUBLE, STR_EXT_DSCNT_AMT DOUBLE, STR_EXT_SALES_PRICE DOUBLE, STR_EXT_WSLE_CST DOUBLE, STR_EXT_LIST_PRICE DOUBLE, STR_EXT_TX DOUBLE, STR_COUPON_AMT DOUBLE, STR_NET_PAID