Re:[DISCUSSION] Order by + Limit Optimization

2017-03-27 Thread 马云


please ignore the email.
my mistake, the mail is not finished.
I will sent a new mail later









At 2017-03-28 13:24:34, "马云"  wrote:
>Hi Dev,
>
>
>currently  I have done an optimization for order by 1 dimension.
>performance test as below
>
>
>my optimization solution for order by 1 dimension as below
>mainly leverage  the dimension's order stored feature in each blocklet
>step1. change logical plan and push down the order by and limit information to 
>carbon scan
>and change sort physical plan to TakeOrderedAndProject  since data 
> will be get and sorted in each partition
>step2. apply the limit number, blocklet's min_max index to filter blocklet. 
>  So it can reduce some scan time if some blocklets was filtered 
>
>
>step3. in each partition,load the order by dimension data for  all blocklet 
>which is filter
>
>
>


[DISCUSSION] Order by + Limit Optimization

2017-03-27 Thread 马云
Hi Dev,


currently  I have done an optimization for order by 1 dimension.
performance test as below


my optimization solution for order by 1 dimension as below
mainly leverage  the dimension's order stored feature in each blocklet
step1. change logical plan and push down the order by and limit information to 
carbon scan
and change sort physical plan to TakeOrderedAndProject  since data 
will be get and sorted in each partition
step2. apply the limit number, blocklet's min_max index to filter blocklet. 
  So it can reduce some scan time if some blocklets was filtered 


step3. in each partition,load the order by dimension data for  all blocklet 
which is filter





Re: carbondata find a bug

2017-03-27 Thread Liang Chen
Hi tianli

First, please send mail to dev-subscr...@carbondata.incubator.apache.org for
joining mailing list group.

Then you can send and receive mail from dev@carbondata.incubator.apache.org.


Can you raise one JIRA at https://issues.apache.org/jira/browse/CARBONDATA,
and raise one pull request for fixing it .


Regards

Liang

2017-03-28 9:41 GMT+05:30 Tian Li 田力 :

> hi:
>  org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD
> find a bug:
> code line 378-380
>
> if (model.isFirstLoad && model.highCardIdentifyEnable
> && !model.isComplexes(split.index)
> && model.dimensions(split.index).isColumnar) {
>
> model.dimensions(split.index).isColumnar must change to
> model.primDimensions(split.index).isColumnar because
> model.isComplexes.length may be != model.dimensions.length  when create
> table use DICTIONARY_EXCLUDE
>
>
> --
> 田力  Tian Li
> 高级工程师  Senior Engineer
>
> 北京(100192)|上海|广州|西安|成都
> 北京 海淀区 西小口路66号 中关村东升科技园B区2号楼3层D区
> 3/F Zone D, Dongsheng Technology Park Building B-2,
> No.66 Xixiaokou Rd, Haidian Dist., Beijing, China
>
>
>
>  Disclaimer The information in this email and any attachments may contain
> proprietary and confidential information that is intended for the
> addressee(s) only. If you are not the intended recipient, you are hereby
> notified that any disclosure, copying, distribution, retention or use of
> the contents of this information is prohibited. When addressed to our
> clients or vendors, any information contained in this e-mail or any
> attachments is subject to the terms and conditions in any governing
> contract.


pyspark carbondata

2017-03-27 Thread ????????
Use python to query carbondata through spark/sparksql?

Re: carbondata find a bug

2017-03-27 Thread QiangCai
+1

Best Regards
David QiangCai




--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-find-a-bug-tp9747p9749.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: question about dimension's sort order in blocklet level

2017-03-27 Thread Liang Chen
Hi

Can you provide one table to show your info, can't see very clear?

The column of high cardinality(>100) would not do dictionary.

Regards
Liang

2017-03-27 14:32 GMT+05:30 马云 :

> Hi DEV,
>
>  I create table according to the below SQL
>
> cc.sql("""
>
>CREATE TABLE IF NOT EXISTS t3
>
>(ID Int,
>
> date Timestamp,
>
> country String,
>
> name String,
>
> phonetype String,
>
> serialname String,
>
> salary Int,
>
> name1 String,
>
> name2 String,
>
> name3 String,
>
> name4 String,
>
> name5 String,
>
> name6 String,
>
> name7 String,
>
> name8 String
>
>)
>
>STORED BY 'carbondata'
>
>""")
>
>
>
> data cardinality as below.
>
> |
>
> column cardinality
>
> |
> |
>
> name
>
> |
>
> name1
>
> |
>
> name2
>
> |
>
> name3
>
> |
>
> name4
>
> |
>
> name5
>
> |
>
> name6
>
> |
>
> name7
>
> |
>
> name8
>
> |
> |
>
> 1000
>
> |
>
> 1000
>
> |
>
> 1000
>
> |
>
> 1000
>
> |
>
> 1000
>
> |
>
> 1000
>
> |
>
> 1000
>
> |
>
> 1000
>
> |
>
> 1000
>
> |
>
>
>
> after I load data to this table, I found the dimension columns "name" and
> "name7"  both have no dictionary encode.
>
> but column "name" has no inverted index and column "name7" has inverted
> index
>
> questions:
>
> 1. the dimension column name  has dictionary decode, but have no inverted
> index, does its' data still have order in DataChunk2 blocklet?
>
> 2. is there any document to introduce these loading strategies?
>
>
> 3. if a dimension column has  no dictionary decode  and no inverted
> index,  user also didn't specify the column with no inverted index when
> create table
> does its' data still have order in DataChunk2 blocklet?
>
> 4. as I know, by default, all dimension column data are sorted and stored
> in DataChunk2 blocklet  except user specify the column with no inverted
> index, right?
>
> 5. as I know the first dimension column of mdk key is always sorted in
> DataChunk2 blocklet, why not set the isExplicitSorted to true?
>
>
>
>  the attached is used to generate the data.csv
>
> package test;
>
>
>
>
> import java.io.BufferedOutputStream;
>
> import java.io.File;
>
> import java.io.FileOutputStream;
>
> import java.io.FileWriter;
>
> import java.util.HashMap;
>
> import java.util.Map;
>
>
>
>
> publicclass CreateData {
>
>
>
>
>   public CreateData() {
>
>
>
>
>   }
>
>
>
>
>   publicstaticvoid main(String[] args) {
>
>
>
>
> FileOutputStream out = null;
>
>
>
>
> FileOutputStream outSTr = null;
>
>
>
>
> BufferedOutputStream Buff = null;
>
>
>
>
> FileWriter fw = null;
>
>
>
>
> intcount = 1000;// 写文件行数
>
>
>
>
> try {
>
>
>
>
>   outSTr = new FileOutputStream(new File("data.csv"));
>
>
>
>
>   Buff = new BufferedOutputStream(outSTr);
>
>
>
>
>   longbegin0 = System.currentTimeMillis();
>
>   Buff.write(
>
>   "ID,date,country,name,phonetype,serialname,salary,
> name1,name2,name3,name4,name5,name6,name7,name8\n"
>
>   .getBytes());
>
>
>
>
>   intidcount = 1000;
>
>   intdatecount = 30;
>
>   intcountrycount = 5;
>
>   // intnamecount =500;
>
>   intphonetypecount = 1;
>
>   intserialnamecount = 5;
>
>   // intsalarycount = 20;
>
>   Map countryMap = new HashMap();
>
>   countryMap.put(1, "usa");
>
>   countryMap.put(2, "uk");
>
>   countryMap.put(3, "china");
>
>   countryMap.put(4, "indian");
>
>   countryMap.put(0, "canada");
>
>
>
>
>   StringBuilder sb = null;
>
>   for (inti = idcount; i >= 0; i--) {
>
>
>
>
> sb = new StringBuilder();
>
> sb.append(400 + i).append(",");// id
>
> sb.append("2015/8/" + (i % datecount + 1)).append(",");
>
> sb.append(countryMap.get(i % countrycount)).append(",");
>
> sb.append("name" + (160 - i)).append(",");// name
>
> sb.append("phone" + i % phonetypecount).append(",");
>
> sb.append("serialname" + (10 + i %
> serialnamecount)).append(",");// serialname
>
> sb.append(i + 50).append(",");
>
> sb.append("name1" + (i + 10)).append(",");// name
>
> sb.append("name2" + (i + 20)).append(",");// name
>
> sb.append("name3" + (i + 30)).append(",");// name
>
> sb.append("name4" + (i + 40)).append(",");// name
>
> sb.append("name5" + (i + 50)).append(",");// name
>
> sb.append("name6" + (i + 60)).append(",");// name
>
> sb.append("name7" + (i + 70)).append(",");// name
>
> sb.append("name8" + (i + 80)).append(",").append('\n');
>
>
>
>
> Buff.write(sb.toString().getBytes());
>
>
>
>
>   }
>
>
>
>
>   Buff.flush();
>
>
>
>
>   Buff.close();
>
>   System.out.println("sb.toString():" + sb.toString());
>
>   longend0 = System.currentTimeMillis();
>
>
>
>
>   System.out.println("BufferedOutputStream执行耗时:" + 

carbondata find a bug

2017-03-27 Thread Tian Li 田力
hi:
 org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD find a 
bug:
code line 378-380
 
if (model.isFirstLoad && model.highCardIdentifyEnable
&& !model.isComplexes(split.index)
&& model.dimensions(split.index).isColumnar) {

model.dimensions(split.index).isColumnar must change to  
model.primDimensions(split.index).isColumnar because model.isComplexes.length 
may be != model.dimensions.length  when create table use DICTIONARY_EXCLUDE


--
田力  Tian Li
高级工程师  Senior Engineer

北京(100192)|上海|广州|西安|成都
北京 海淀区 西小口路66号 中关村东升科技园B区2号楼3层D区
3/F Zone D, Dongsheng Technology Park Building B-2,
No.66 Xixiaokou Rd, Haidian Dist., Beijing, China



 Disclaimer The information in this email and any attachments may contain 
proprietary and confidential information that is intended for the addressee(s) 
only. If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution, retention or use of the contents of this 
information is prohibited. When addressed to our clients or vendors, any 
information contained in this e-mail or any attachments is subject to the terms 
and conditions in any governing contract.

Re: data not input hive

2017-03-27 Thread Sea
Now Spark persist data source table into Hive metastore in Spark SQL specific 
format. This is not a bug.




-- Original --
From:  "";<1141982...@qq.com>;
Date:  Mon, Mar 27, 2017 04:47 PM
To:  "dev"; 

Subject:  data not input hive



spark 2.1.0
 hive 1.2.1
 Couldn't find corresponding Hive SerDe for data source provider 
org.apache.spark.sql.CarbonSource. Persisting data source table 
`default`.`carbon_table30` into Hive metastore in Spark SQL specific format, 
which is NOT compatible with Hive.

Re: [apache/incubator-carbondata] [CARBONDATA-727][WIP] addhiveintegration for carbon (#672)

2017-03-27 Thread Sea
Hi, Anubhav:
Do you use mysql to store the hive metadata?spark sql and hive must use the 
same metastore.
PS: Before you query data using hive,  you should alter table schema.


This is the latest guide.
https://github.com/cenyuhai/incubator-carbondata/blob/CARBONDATA-727/integration/hive/hive-guide.md




-- Original --
From:  "Anubhav Tarar";;
Date:  Mon, Mar 27, 2017 02:59 PM
To:  "dev"; 
Cc:  "chenliang613"; 
"Mention"; 
Subject:  Re: [apache/incubator-carbondata] [CARBONDATA-727][WIP] 
addhiveintegration for carbon (#672)



@sea hi i tried to use hive with the steps you mentioned from you pr but
get table not found exception from hive cli, here are the steps i use

1.start the spark shell with hive and carbon bulids

./spark-shell --jars /home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/
carbondata_2.11-1.incubating-SNAPSHOT-shade-hadoop2.7.2.
jar,/home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/carbondata-hive-1.1.0-
incubating-SNAPSHOT.jar

2.create the carbonsession and create and load tables

scala> import org.apache.spark.sql.CarbonSession._
import org.apache.spark.sql.CarbonSession._

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> val carbon = SparkSession.builder().enableHiveSupport().config(sc.
getConf).getOrCreateCarbonSession("hdfs://localhost:54310/opt/carbonStore")

scala>carbon.sql("create table hive_carbon(id int, name string, scale
decimal, country string, salary double) STORED BY 'carbondata'")
scala>carbon.sql("LOAD DATA INPATH 'hdfs://localhost:54310/sample.csv' INTO
TABLE hive_carbon")

3.start hive cli and added the jars

hive> add jar /home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/
carbondata-hive-1.1.0-incubating-SNAPSHOT.jar;
Added [/home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/
carbondata-hive-1.1.0-incubating-SNAPSHOT.jar] to class path
Added resources: [/home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/
carbondata-hive-1.1.0-incubating-SNAPSHOT.jar]

hive> add jar /home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/
carbondata_2.11-1.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar;
Added [/home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/
carbondata_2.11-1.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar] to class
path
Added resources: [/home/hduser/spark-2.1.0-bin-hadoop2.7/carbonlib/
carbondata_2.11-1.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar]

hive> add jar /home/hduser/spark-2.1.0-bin-hadoop2.7/jars/spark-catalyst_
2.11-2.1.0.jar;
Added 
[/home/hduser/spark-2.1.0-bin-hadoop2.7/jars/spark-catalyst_2.11-2.1.0.jar]
to class path
Added resources: [/home/hduser/spark-2.1.0-bin-
hadoop2.7/jars/spark-catalyst_2.11-2.1.0.jar]


4.query data using hive

hive> select * from hive_carbon;
FAILED: SemanticException [Error 10001]: Line 1:14 Table not found
'hive_carbon'







On Fri, Mar 24, 2017 at 9:30 AM, Sea <261810...@qq.com> wrote:

> I forgot something.
> Before query data from hive. We should set
> set hive.mapred.supports.subdirectories=true;
> set mapreduce.input.fileinputformat.input.dir.recursive=true;
>
>
> -- Original --
> From:  "261810726";<261810...@qq.com>;
> Date:  Thu, Mar 23, 2017 09:58 PM
> To:  "chenliang613"; "dev" incubator.apache.org>;
> Cc:  "Mention";
> Subject:  Re:  [apache/incubator-carbondata] [CARBONDATA-727][WIP] add
> hiveintegration for carbon (#672)
>
>
>
> Hi, liang:
> I create a new profile "integration/hive" and the CI is OK now. But I
> still have some problems in altering hive metastore schema.
> My steps are as following:
>
> 1.build carbondata
>
>
> mvn -DskipTests -Pspark-2.1 -Dspark.version=2.1.0 clean package
> -Phadoop-2.7.2 -Phive-1.2.1
>
>
>
> 2.copy jars
>
>
> mkdir ~/spark-2.1/carbon_lib
> cp ~/cenyuhai/incubator-carbondata/assembly/target/
> scala-2.11/carbondata_2.11-1.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar
> ~/spark-2.1/carbon_lib/
> cp ~/cenyuhai/incubator-carbondata/integration/hive/
> target/carbondata-hive-1.1.0-incubating-SNAPSHOT.jar
> ~/spark-2.1/carbon_lib/
>
>
>
> 3.create sample.csv and put it into hdfs
>
>
> id,name,scale,country,salary
> 1,yuhai,1.77,china,33000.0
> 2,runlin,1.70,china,32000.0
>
>
>
> 4.create table in spark
>
>
> spark-shell --jars "/data/hadoop/spark-2.1/carbon_lib/carbondata_2.11-1.
> 1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar,/data/hadoop/
> spark-2.1/carbon_lib/carbondata-hive-1.1.0-incubating-SNAPSHOT.jar"
>
>
> #execute these commands:
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.CarbonSession._
> val rootPath = "hdfs:user/hadoop/carbon"
> val storeLocation = s"$rootPath/store"
> val warehouse = s"$rootPath/warehouse"
> val metastoredb = s"$rootPath/metastore_db"
>
>
> val carbon = 
> 

[jira] [Created] (CARBONDATA-827) Query statistics log format is incorrect

2017-03-27 Thread Jacky Li (JIRA)
Jacky Li created CARBONDATA-827:
---

 Summary: Query statistics log format is incorrect
 Key: CARBONDATA-827
 URL: https://issues.apache.org/jira/browse/CARBONDATA-827
 Project: CarbonData
  Issue Type: Bug
Reporter: Jacky Li


The output log for query statistics has repeated numbers which is incorrect



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[DISCUSSION]: (New Feature) Streaming Ingestion into CarbonData

2017-03-27 Thread Aniket Adnaik
Hi All,

I would like to open up a discussion for new feature to support streaming
ingestion in CarbonData.

Please refer to design document(draft) in the link below.
  https://drive.google.com/file/d/0B71_EuXTdDi8MlFDU2tqZU9BZ3M
/view?usp=sharing

Your comments/suggestions are welcome.
Here are some high level points.

Rationale:
The current ways of adding user data to CarbonData table is via LOAD
statement or using SELECT query with INSERT INTO statement. These methods
add bulk of data into CarbonData table into a new segment. Basically, it is
a batch insertion for a bulk of data. However, with increasing demand of
real time data analytics with streaming frameworks, CarbonData needs a way
to insert streaming data continuously into CarbonData table. CarbonData
needs a support for continuous and faster ingestion into CarbonData table
and make it available for querying.

CarbonData can leverage from our newly introduced V3 format to append
streaming data to existing carbon table.


Requirements:

Following are some high level requirements;
1.  CarbonData shall create a new segment (Streaming Segment) for each
streaming session. Concurrent streaming ingestion into same table will
create separate streaming segments.

2.  CarbonData shall use write optimized format (instead of multi-layered
indexed columnar format) to support ingestion of streaming data into a
CarbonData table.

3.  CarbonData shall create streaming segment folder and open a streaming
data file in append mode to write data. CarbonData should avoid creating
multiple small files by appending to an existing file.

4.  The data stored in new streaming segment shall be available for query
after it is written to the disk (hflush/hsync). In other words, CarbonData
Readers should be able to query the data in streaming segment written so
far.

5.  CarbonData should acknowledge the write operation status back to output
sink/upper layer streaming engine so that in the case of write failure,
streaming engine should restart the operation and maintain exactly once
delivery semantics.

6.  CarbonData Compaction process shall support compacting data from
write-optimized streaming segment to regular read optimized columnar
CarbonData format.

7.  CarbonData readers should maintain the read consistency by means of
using timestamp.

8.  Maintain durability - in case of write failure, CarbonData should be
able recover to latest commit status. This may require maintaining source
and destination offsets of last commits in a metadata.

This feature can be done in phases;

Phase -1 : Add basic framework and writer support to allow Spark Structured
streaming into CarbonData . This phase may or may not have append support.
Add reader support to read streaming data files.

Phase-2 : Add append support if not done in phase 1. Maintain append
offsets and metadata information.

Phase -3 : Add support for external streaming frameworks such as Kafka
streaming using spark structured steaming, maintain
topics/partitions/offsets and support fault tolerance .

Phase-4 : Add support to other streaming frameworks , such as flink , beam
etc.

Phase-5: Future support for in-memory cache for buffering streaming data,
support for union with Spark Structured streaming to serve directly from
spark structured streaming.  And add support for Time series data.

Best Regards,
Aniket


Re: data not input hive

2017-03-27 Thread Jacky Li
Hi,

Carbon does not support load data using Hive yet. You can use Spark to load.

Regards,
Jacky

> 在 2017年3月27日,下午2:17,风云际会 <1141982...@qq.com> 写道:
> 
> spark 2.1.0
> hive 1.2.1
> Couldn't find corresponding Hive SerDe for data source provider 
> org.apache.spark.sql.CarbonSource. Persisting data source table 
> `default`.`carbon_table30` into Hive metastore in Spark SQL specific format, 
> which is NOT compatible with Hive.





data not input hive

2017-03-27 Thread ????????
spark 2.1.0
 hive 1.2.1
 Couldn't find corresponding Hive SerDe for data source provider 
org.apache.spark.sql.CarbonSource. Persisting data source table 
`default`.`carbon_table30` into Hive metastore in Spark SQL specific format, 
which is NOT compatible with Hive.

[jira] [Created] (CARBONDATA-825) upload or delete problem

2017-03-27 Thread sehriff (JIRA)
sehriff created CARBONDATA-825:
--

 Summary: upload or delete problem
 Key: CARBONDATA-825
 URL: https://issues.apache.org/jira/browse/CARBONDATA-825
 Project: CarbonData
  Issue Type: Bug
Reporter: sehriff


can't update/delete successfully like cc.sql(update/delete ...)  
unless add show syntax like cc.sql(updae/delete...).show,
beeline client can't update/delete successfully neither in thriftserver mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CARBONDATA-826) Create carbondata-connector of presto for supporting presto query carbon data

2017-03-27 Thread Liang Chen (JIRA)
Liang Chen created CARBONDATA-826:
-

 Summary: Create carbondata-connector of presto for supporting 
presto query carbon data
 Key: CARBONDATA-826
 URL: https://issues.apache.org/jira/browse/CARBONDATA-826
 Project: CarbonData
  Issue Type: Sub-task
  Components: presto-integration
Reporter: Liang Chen
Assignee: Liang Chen
Priority: Minor


1.In CarbonData project, generate carbondata-connector of presto
2.Copy carbondata-connector to presto/plugin/
3.Run query in presto to read carbon data. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Re:Re:Re:Re:Re:Re: insert into carbon table failed

2017-03-27 Thread Liang Chen
Hi 

Please enable vector , it might help limit query.

import org.apache.carbondata.core.util.CarbonProperties
import org.apache.carbondata.core.constants.CarbonCommonConstants
CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER,
"true")

Regards
Liang


a wrote
> TEST SQL :
> 高基数随机查询
> select * From carbon_table where dt='2017-01-01' and user_id='' limit
> 100;
> 
> 
> 高基数随机查询like
> select * From carbon_table where dt='2017-01-01' and fo like '%%'
> limit 100;
> 
> 
> 低基数随机查询
> select * From carbon_table where dt='2017-01-01' and plat='android' and
> tv='8400' limit 100
> 
> 
> 1维度查询
> select province,sum(play_pv) play_pv ,sum(spt_cnt) spt_cnt
> from carbon_table where dt='2017-01-01' and sty=''
> group by province
> 
> 
> 2维度查询
> select province,city,sum(play_pv) play_pv ,sum(spt_cnt) spt_cnt
> from carbon_table where dt='2017-01-01' and sty=''
> group by province,city
> 
> 
> 3维度查询
> select province,city,isp,sum(play_pv) play_pv ,sum(spt_cnt) spt_cnt
> from carbon_table where dt='2017-01-01' and sty=''
> group by province,city,isp
> 
> 
> 多维度查询
> select sty,isc,status,nw,tv,area,province,city,isp,sum(play_pv)
> play_pv_sum ,sum(spt_cnt) spt_cnt_sum
> from carbon_table where dt='2017-01-01' and sty=''
> group by sty,isc,status,nw,tv,area,province,city,isp
> 
> 
> distinct 单列
> select tv, count(distinct user_id) 
> from carbon_table where dt='2017-01-01' and sty='' and fo like
> '%%' group by tv
> 
> 
> distinct 多列
> select count(distinct user_id) ,count(distinct mid),count(distinct case
> when sty='' then mid end)
> from carbon_table where dt='2017-01-01' and sty=''
> 
> 
> 排序查询
> select user_id,sum(play_pv) play_pv_sum
> from carbon_table
> group by user_id
> order by play_pv_sum desc limit 100
> 
> 
> 简单join查询
> select b.fo_level1,b.fo_level2,sum(a.play_pv) play_pv_sum From
> carbon_table a
> left join dim_carbon_table b
> on a.fo=b.fo and a.dt = b.dt where a.dt = '2017-01-01' group by
> b.fo_level1,b.fo_level2
> 
> At 2017-03-27 04:10:04, "a" 

> wwyxg@

>  wrote:
>>I download  the newest sourcecode (master) and compile,generate the jar
carbondata_2.11-1.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar
>>Then i use spark2.1 test again.The error logs are as follow:
>>
>>
>> Container log :
>>17/03/27 02:27:21 ERROR newflow.DataLoadExecutor: Executor task launch
worker-9 Data Loading failed for table carbon_table
>>java.lang.NullPointerException
>>at
>> org.apache.carbondata.processing.newflow.DataLoadProcessBuilder.createConfiguration(DataLoadProcessBuilder.java:158)
>>at
>> org.apache.carbondata.processing.newflow.DataLoadProcessBuilder.build(DataLoadProcessBuilder.java:60)
>>at
>> org.apache.carbondata.processing.newflow.DataLoadExecutor.execute(DataLoadExecutor.java:43)
>>at org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD$$anon$2.
> 
> (NewCarbonDataLoadRDD.scala:365)
>>at
>> org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD.compute(NewCarbonDataLoadRDD.scala:322)
>>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>>at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>at java.lang.Thread.run(Thread.java:745)
>>17/03/27 02:27:21 INFO rdd.NewDataFrameLoaderRDD: DataLoad failure
>>org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException:
Data Loading failed for table carbon_table
>>at
>> org.apache.carbondata.processing.newflow.DataLoadExecutor.execute(DataLoadExecutor.java:54)
>>at org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD$$anon$2.
> 
> (NewCarbonDataLoadRDD.scala:365)
>>at
>> org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD.compute(NewCarbonDataLoadRDD.scala:322)
>>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>>at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>at java.lang.Thread.run(Thread.java:745)
>>Caused by: java.lang.NullPointerException
>>at
>> org.apache.carbondata.processing.newflow.DataLoadProcessBuilder.createConfiguration(DataLoadProcessBuilder.java:158)
>>at
>> 

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

2017-03-27 Thread manish gupta
+1

Regards
Manish Gupta

On Mon, Mar 27, 2017 at 2:41 PM, Kumar Vishal 
wrote:

> +1
> -Regards
> Kumar Vishal
>
> On Mar 27, 2017 09:31, "xm_zzc" <441586...@qq.com> wrote:
>
> > Hi, Liang:
> >   Thanks for your reply.
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Re-DISCUSSION-
> > Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9672p9680.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>


question about dimension's sort order in blocklet level

2017-03-27 Thread 马云
Hi DEV,

 I create table according to the below SQL 

cc.sql(""" 

   CREATE TABLE IF NOT EXISTS t3 

   (ID Int,

date Timestamp,

country String, 

name String,

phonetype String,

serialname String,

salary Int, 

name1 String,

name2 String,

name3 String,

name4 String,

name5 String,

name6 String,

name7 String,

name8 String 

   ) 

   STORED BY 'carbondata' 

   """) 

 

data cardinality as below.

|

column cardinality

|
|

name

|

name1

|

name2

|

name3

|

name4

|

name5

|

name6

|

name7

|

name8

|
|

1000

|

1000

|

1000

|

1000

|

1000

|

1000

|

1000

|

1000

|

1000

|

 

after I load data to this table, I found the dimension columns "name" and 
"name7"  both have no dictionary encode. 

but column "name" has no inverted index and column "name7" has inverted index 

questions: 

1. the dimension column name  has dictionary decode, but have no inverted 
index, does its' data still have order in DataChunk2 blocklet? 

2. is there any document to introduce these loading strategies?


3. if a dimension column has  no dictionary decode  and no inverted index,  
user also didn't specify the column with no inverted index when create table
does its' data still have order in DataChunk2 blocklet? 

4. as I know, by default, all dimension column data are sorted and stored in 
DataChunk2 blocklet  except user specify the column with no inverted index, 
right?

5. as I know the first dimension column of mdk key is always sorted in 
DataChunk2 blocklet, why not set the isExplicitSorted to true?

 

 the attached is used to generate the data.csv

package test;




import java.io.BufferedOutputStream;

import java.io.File;

import java.io.FileOutputStream;

import java.io.FileWriter;

import java.util.HashMap;

import java.util.Map;




publicclass CreateData {




  public CreateData() {




  }




  publicstaticvoid main(String[] args) {




FileOutputStream out = null;




FileOutputStream outSTr = null;




BufferedOutputStream Buff = null;




FileWriter fw = null;




intcount = 1000;// 写文件行数




try {




  outSTr = new FileOutputStream(new File("data.csv"));




  Buff = new BufferedOutputStream(outSTr);




  longbegin0 = System.currentTimeMillis();

  Buff.write(

  
"ID,date,country,name,phonetype,serialname,salary,name1,name2,name3,name4,name5,name6,name7,name8\n"

  .getBytes());




  intidcount = 1000;

  intdatecount = 30;

  intcountrycount = 5;

  // intnamecount =500;

  intphonetypecount = 1;

  intserialnamecount = 5;

  // intsalarycount = 20;

  Map countryMap = new HashMap();

  countryMap.put(1, "usa");

  countryMap.put(2, "uk");

  countryMap.put(3, "china");

  countryMap.put(4, "indian");

  countryMap.put(0, "canada");




  StringBuilder sb = null;

  for (inti = idcount; i >= 0; i--) {




sb = new StringBuilder();

sb.append(400 + i).append(",");// id

sb.append("2015/8/" + (i % datecount + 1)).append(",");

sb.append(countryMap.get(i % countrycount)).append(",");

sb.append("name" + (160 - i)).append(",");// name

sb.append("phone" + i % phonetypecount).append(",");

sb.append("serialname" + (10 + i % serialnamecount)).append(",");// 
serialname

sb.append(i + 50).append(",");

sb.append("name1" + (i + 10)).append(",");// name

sb.append("name2" + (i + 20)).append(",");// name

sb.append("name3" + (i + 30)).append(",");// name

sb.append("name4" + (i + 40)).append(",");// name

sb.append("name5" + (i + 50)).append(",");// name

sb.append("name6" + (i + 60)).append(",");// name

sb.append("name7" + (i + 70)).append(",");// name

sb.append("name8" + (i + 80)).append(",").append('\n');




Buff.write(sb.toString().getBytes());




  }




  Buff.flush();




  Buff.close();

  System.out.println("sb.toString():" + sb.toString());

  longend0 = System.currentTimeMillis();




  System.out.println("BufferedOutputStream执行耗时:" + (end0 - begin0) + " 豪秒");




} catch (Exception e) {




  e.printStackTrace();




}




finally {




  try {




// fw.close();




Buff.close();




outSTr.close();




// out.close();




  } catch (Exception e) {




e.printStackTrace();




  }




}




  }




}

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

2017-03-27 Thread Kumar Vishal
+1
-Regards
Kumar Vishal

On Mar 27, 2017 09:31, "xm_zzc" <441586...@qq.com> wrote:

> Hi, Liang:
>   Thanks for your reply.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Re-DISCUSSION-
> Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9672p9680.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>


Re: insert into carbon table failed

2017-03-27 Thread william
I guess then word node in "Carbodata launches one job per each node to sort
the data at node level and
avoid shuffling" may make some confuse. I guess  carbondata should launches
one task per each executor . here job should be task ,node should be
executor.

Maybe he can try increase the number of executors to avoid memory problem


Carbondata resolve kettle dependencies fail

2017-03-27 Thread william
As the attachment shows: Carbondata resolve kettle dependencies fail. Can
anyone know
how to fix this?
Also,if you use exclusion in maven to exclude the kettle, your project will
fail even in compile time.


Re: [jira] [Created] (CARBONDATA-824) Null pointer Exception display to user while performance Testing

2017-03-27 Thread william
First, can you try change the create table statement ends with STORED BY
'carbondata' instead of STORED BY 'org.apache.carbondata.format';
Second , can you give some sample data instead of trying to upload 32GB CSV
file.


[jira] [Created] (CARBONDATA-824) Null pointer Exception display to user while performance Testing

2017-03-27 Thread Vinod Rohilla (JIRA)
Vinod Rohilla created CARBONDATA-824:


 Summary: Null pointer Exception display to user while performance 
Testing
 Key: CARBONDATA-824
 URL: https://issues.apache.org/jira/browse/CARBONDATA-824
 Project: CarbonData
  Issue Type: Bug
  Components: data-query
Affects Versions: 1.0.1-incubating
 Environment: SPARK 2.1
Reporter: Vinod Rohilla


Displays null pointer exception to the user while select Query.

Steps to reproduces:

1: Create table:

CREATE TABLE oscon_new_1 (ACTIVE_AREA_ID String, ACTIVE_CHECK_DY String, 
ACTIVE_CHECK_HOUR String, ACTIVE_CHECK_MM String, ACTIVE_CHECK_TIME String, 
ACTIVE_CHECK_YR String, ACTIVE_CITY String, ACTIVE_COUNTRY String, 
ACTIVE_DISTRICT String, ACTIVE_EMUI_VERSION String, ACTIVE_FIRMWARE_VER String, 
ACTIVE_NETWORK String, ACTIVE_OS_VERSION String, ACTIVE_PROVINCE String, BOM 
String, CHECK_DATE String, CHECK_DY String, CHECK_HOUR String, CHECK_MM String, 
CHECK_YR String, CUST_ADDRESS_ID String, CUST_AGE String, CUST_BIRTH_COUNTRY 
String, CUST_BIRTH_DY String, CUST_BIRTH_MM String, CUST_BIRTH_YR String, 
CUST_BUY_POTENTIAL String, CUST_CITY String, CUST_STATE String, CUST_COUNTRY 
String, CUST_COUNTY String, CUST_EMAIL_ADDR String, CUST_LAST_RVW_DATE 
TIMESTAMP, CUST_FIRST_NAME String, CUST_ID String, CUST_JOB_TITLE String, 
CUST_LAST_NAME String, CUST_LOGIN String, CUST_NICK_NAME String, CUST_PRFRD_FLG 
String, CUST_SEX String, CUST_STREET_NAME String, CUST_STREET_NO String, 
CUST_SUITE_NO String, CUST_ZIP String, DELIVERY_CITY String, DELIVERY_STATE 
String, DELIVERY_COUNTRY String, DELIVERY_DISTRICT String, DELIVERY_PROVINCE 
String, DEVICE_NAME String, INSIDE_NAME String, ITM_BRAND String, ITM_BRAND_ID 
String, ITM_CATEGORY String, ITM_CATEGORY_ID String, ITM_CLASS String, 
ITM_CLASS_ID String, ITM_COLOR String, ITM_CONTAINER String, ITM_FORMULATION 
String, ITM_MANAGER_ID String, ITM_MANUFACT String, ITM_MANUFACT_ID String, 
ITM_ID String, ITM_NAME String, ITM_REC_END_DATE String, ITM_REC_START_DATE 
String, LATEST_AREAID String, LATEST_CHECK_DY String, LATEST_CHECK_HOUR String, 
LATEST_CHECK_MM String, LATEST_CHECK_TIME String, LATEST_CHECK_YR String, 
LATEST_CITY String, LATEST_COUNTRY String, LATEST_DISTRICT String, 
LATEST_EMUI_VERSION String, LATEST_FIRMWARE_VER String, LATEST_NETWORK String, 
LATEST_OS_VERSION String, LATEST_PROVINCE String, OL_ORDER_DATE String, 
OL_ORDER_NO INT, OL_RET_ORDER_NO String, OL_RET_DATE String, OL_SITE String, 
OL_SITE_DESC String, PACKING_DATE String, PACKING_DY String, PACKING_HOUR 
String, PACKING_LIST_NO String, PACKING_MM String, PACKING_YR String, 
PRMTION_ID String, PRMTION_NAME String, PRM_CHANNEL_CAT String, 
PRM_CHANNEL_DEMO String, PRM_CHANNEL_DETAILS String, PRM_CHANNEL_DMAIL String, 
PRM_CHANNEL_EMAIL String, PRM_CHANNEL_EVENT String, PRM_CHANNEL_PRESS String, 
PRM_CHANNEL_RADIO String, PRM_CHANNEL_TV String, PRM_DSCNT_ACTIVE String, 
PRM_END_DATE String, PRM_PURPOSE String, PRM_START_DATE String, PRODUCT_ID 
String, PROD_BAR_CODE String, PROD_BRAND_NAME String, PRODUCT_NAME String, 
PRODUCT_MODEL String, PROD_MODEL_ID String, PROD_COLOR String, PROD_SHELL_COLOR 
String, PROD_CPU_CLOCK String, PROD_IMAGE String, PROD_LIVE String, PROD_LOC 
String, PROD_LONG_DESC String, PROD_RAM String, PROD_ROM String, PROD_SERIES 
String, PROD_SHORT_DESC String, PROD_THUMB String, PROD_UNQ_DEVICE_ADDR String, 
PROD_UNQ_MDL_ID String, PROD_UPDATE_DATE String, PROD_UQ_UUID String, 
SHP_CARRIER String, SHP_CODE String, SHP_CONTRACT String, SHP_MODE_ID String, 
SHP_MODE String, STR_ORDER_DATE String, STR_ORDER_NO String, TRACKING_NO 
String, WH_CITY String, WH_COUNTRY String, WH_COUNTY String, WH_ID String, 
WH_NAME String, WH_STATE String, WH_STREET_NAME String, WH_STREET_NO String, 
WH_STREET_TYPE String, WH_SUITE_NO String, WH_ZIP String, CUST_DEP_COUNT 
DOUBLE, CUST_VEHICLE_COUNT DOUBLE, CUST_ADDRESS_CNT DOUBLE, CUST_CRNT_CDEMO_CNT 
DOUBLE, CUST_CRNT_HDEMO_CNT DOUBLE, CUST_CRNT_ADDR_DM DOUBLE, 
CUST_FIRST_SHIPTO_CNT DOUBLE, CUST_FIRST_SALES_CNT DOUBLE, CUST_GMT_OFFSET 
DOUBLE, CUST_DEMO_CNT DOUBLE, CUST_INCOME DOUBLE, PROD_UNLIMITED INT, 
PROD_OFF_PRICE DOUBLE, PROD_UNITS INT, TOTAL_PRD_COST DOUBLE, TOTAL_PRD_DISC 
DOUBLE, PROD_WEIGHT DOUBLE, REG_UNIT_PRICE DOUBLE, EXTENDED_AMT DOUBLE, 
UNIT_PRICE_DSCNT_PCT DOUBLE, DSCNT_AMT DOUBLE, PROD_STD_CST DOUBLE, 
TOTAL_TX_AMT DOUBLE, FREIGHT_CHRG DOUBLE, WAITING_PERIOD DOUBLE, 
DELIVERY_PERIOD DOUBLE, ITM_CRNT_PRICE DOUBLE, ITM_UNITS DOUBLE, ITM_WSLE_CST 
DOUBLE, ITM_SIZE DOUBLE, PRM_CST DOUBLE, PRM_RESPONSE_TARGET DOUBLE, PRM_ITM_DM 
DOUBLE, SHP_MODE_CNT DOUBLE, WH_GMT_OFFSET DOUBLE, WH_SQ_FT DOUBLE, STR_ORD_QTY 
DOUBLE, STR_WSLE_CST DOUBLE, STR_LIST_PRICE DOUBLE, STR_SALES_PRICE DOUBLE, 
STR_EXT_DSCNT_AMT DOUBLE, STR_EXT_SALES_PRICE DOUBLE, STR_EXT_WSLE_CST DOUBLE, 
STR_EXT_LIST_PRICE DOUBLE, STR_EXT_TX DOUBLE, STR_COUPON_AMT DOUBLE, 
STR_NET_PAID