Re: how to use the sql join in java please

2018-04-11 Thread Yu, Yucai
Do you really want to do a cartesian product on those two tables?
If yes, you can set spark.sql.crossJoin.enabled=true.

Thanks,
Yucai

From: "1427357...@qq.com" <1427357...@qq.com>
Date: Wednesday, April 11, 2018 at 3:16 PM
To: spark?users 
Subject: how to use the sql join in java please

Hi  all,

I write java code to join two table.
My code looks like:


SparkSession ss = 
SparkSession.builder().master("local[4]").appName("testSql").getOrCreate();

Properties properties = new Properties();
properties.put("user","A");
properties.put("password","B");
String url = 
"jdbc:mysql://xxx:/xxx?useUnicode=true&characterEncoding=gbk&zeroDateTimeBehavior=convertToNull&serverTimezone=UTC";
Dataset data_busi_hour = ss.read().jdbc(url, "A", properties);
data_busi_hour.show();
//newemployee.printSchema();

Dataset t_pro_ware_partner_rela = ss.read().jdbc(url, "B", 
properties);

Dataset newX  = t_pro_ware_partner_rela.join(data_busi_hour);
newX.show();

I get a error  like below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected 
cartesian product for INNER join between logical plans
Relation[ XXX   FIRST_ORG_ARCHNAME#80,... 11 more fields] 
JDBCRelation(t_pro_ware_partner_rela) [numPartitions=1]
and
Relation[id#0L,project_code#1,project_name#2] JDBCRelation(data_busi_hour) 
[numPartitions=1]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(Quer

RE: Can we disable parquet logs in Spark?

2016-10-21 Thread Yu, Yucai
I set "log4j.rootCategory=ERROR, console" and using "-file 
conf/log4f.properties" to make most of logs suppressed, but those 
org.apache.parquet log still exists.

Any way to disable them also?

Thanks,
Yucai

From: Yu, Yucai [mailto:yucai...@intel.com]
Sent: Friday, October 21, 2016 2:50 PM
To: user@spark.apache.org
Subject: Can we disable parquet logs in Spark?

Hi,

I see lots of parquet logs in container logs(YARN mode), like below:

stdout:
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 8,448B for 
[ss_promo_sk] INT32: 5,996 values, 8,513B raw, 8,409B comp, 1 pages, encodings: 
[PLAIN_DICTIONARY, BIT_PACKED, RLE], dic { 1,475 entries, 5,900B raw, 1,475B 
comp}
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 1,376B for 
[ss_ticket_number] INT32: 5,996 values, 1,730B raw, 1,340B comp, 1 pages, 
encodings: [PLAIN_DICTIONARY, BIT_PACKED, RLE], dic { 524 entries, 2,096B raw, 
524B comp}
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 5,516B for 
[ss_quantity] INT32: 5,996 values, 5,567B raw, 5,479B comp, 1 pages, encodings: 
[PLAIN_DICTIONARY, BIT_PACKED, RLE], dic { 100 entries, 400B raw, 100B comp}
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 14,385B for 
[ss_wholesale_cost] INT32: 5,996 values, 23,931B raw, 14,346B comp, 1 pages, 
encodings: [BIT_PACKED, PLAIN, RLE]
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 15,043B for 
[ss_list_price] INT32: 5,996 values, 23,871B raw, 15,004B comp, 1 pages, 
encodings: [BIT_PACKED, PLAIN, RLE]
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 14,442B for 
[ss_sales_price] INT32: 5,996 values, 23,896B raw, 14,403B comp, 1 pages, 
encodings: [BIT_PACKED, PLAIN, RLE]
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 3,538B for 
[ss_ext_discount_amt] INT32: 5,996 values, 7,317B raw, 3,501B comp, 1 pages, 
encodings: [PLAIN_DICTIONARY, BIT_PACKED, RLE], dic { 1,139 entries, 4,556B 
raw, 1,139B comp}
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 18,052B for 
[ss_ext_sales_price] INT32: 5,996 values, 23,907B raw, 18,013B comp, 1 pages, 
encodings: [BIT_PACKED, PLAIN, RLE]
Oc

I tried below in log4j.properties, but not work.
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

Is there a way to disable them?

Thanks a lot!

Yucai


Can we disable parquet logs in Spark?

2016-10-20 Thread Yu, Yucai
Hi,

I see lots of parquet logs in container logs(YARN mode), like below:

stdout:
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 8,448B for 
[ss_promo_sk] INT32: 5,996 values, 8,513B raw, 8,409B comp, 1 pages, encodings: 
[PLAIN_DICTIONARY, BIT_PACKED, RLE], dic { 1,475 entries, 5,900B raw, 1,475B 
comp}
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 1,376B for 
[ss_ticket_number] INT32: 5,996 values, 1,730B raw, 1,340B comp, 1 pages, 
encodings: [PLAIN_DICTIONARY, BIT_PACKED, RLE], dic { 524 entries, 2,096B raw, 
524B comp}
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 5,516B for 
[ss_quantity] INT32: 5,996 values, 5,567B raw, 5,479B comp, 1 pages, encodings: 
[PLAIN_DICTIONARY, BIT_PACKED, RLE], dic { 100 entries, 400B raw, 100B comp}
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 14,385B for 
[ss_wholesale_cost] INT32: 5,996 values, 23,931B raw, 14,346B comp, 1 pages, 
encodings: [BIT_PACKED, PLAIN, RLE]
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 15,043B for 
[ss_list_price] INT32: 5,996 values, 23,871B raw, 15,004B comp, 1 pages, 
encodings: [BIT_PACKED, PLAIN, RLE]
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 14,442B for 
[ss_sales_price] INT32: 5,996 values, 23,896B raw, 14,403B comp, 1 pages, 
encodings: [BIT_PACKED, PLAIN, RLE]
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 3,538B for 
[ss_ext_discount_amt] INT32: 5,996 values, 7,317B raw, 3,501B comp, 1 pages, 
encodings: [PLAIN_DICTIONARY, BIT_PACKED, RLE], dic { 1,139 entries, 4,556B 
raw, 1,139B comp}
Oct 21, 2016 2:27:30 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 18,052B for 
[ss_ext_sales_price] INT32: 5,996 values, 23,907B raw, 18,013B comp, 1 pages, 
encodings: [BIT_PACKED, PLAIN, RLE]
Oc

I tried below in log4j.properties, but not work.
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

Is there a way to disable them?

Thanks a lot!

Yucai


RE: 回复: build/sbt gen-idea error

2016-04-13 Thread Yu, Yucai
Reminder: gen-idea has been removed in the master. See:

commit a172e11cba6f917baf5bd6c4f83dc6689932de9a
Author: Luciano Resende 
Date:   Mon Apr 4 16:55:59 2016 -0700

[SPARK-14366] Remove sbt-idea plugin

## What changes were proposed in this pull request?

Remove sbt-idea plugin as importing sbt project provides much better 
support.

Author: Luciano Resende 

Closes #12151 from lresende/SPARK-14366.


From: ImMr.K [mailto:875061...@qq.com]
Sent: Wednesday, April 13, 2016 9:48 PM
To: Ted Yu 
Cc: user 
Subject: 回复: build/sbt gen-idea error

Actually, same error occurred when I ran build/sbt compile or other commands. 
After struggled for some time, I reminded that I used proxy to connect to 
Internet. So set proxy to maven, everything seems OK. Just remind those who use 
proxies.

--
Best regards,
Ze Jin



-- 原始邮件 --
发件人: "Ted Yu";mailto:yuzhih...@gmail.com>>;
发送时间: 2016年4月12日(星期二) 晚上11:38
收件人: "ImMr.K"<875061...@qq.com>;
抄送: "user"mailto:user@spark.apache.org>>;
主题: Re: build/sbt gen-idea error

gen-idea doesn't seem to be a valid command:

[warn] Ignoring load failure: no project loaded.
[error] Not a valid command: gen-idea
[error] gen-idea

On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K 
<875061...@qq.com> wrote:
Hi,
I have cloned spark and ,
cd spark
build/sbt gen-idea

got the following output:


Using /usr/java/jre1.7.0_09 as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
[info] Loading project definition from /home/king/github/spark/project/project
[info] Loading project definition from 
/home/king/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
[warn] Multiple resolvers having different access mechanism configured with 
same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project 
resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Loading project definition from /home/king/github/spark/project
org.apache.maven.model.building.ModelBuildingException: 1 problem was 
encountered while building the effective model for 
org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT
[FATAL] Non-resolvable parent POM: Could not transfer artifact 
org.apache:apache:pom:14 from/to central ( 
http://repo.maven.apache.org/maven2): Error transferring file: Connection timed 
out from  
http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and 
'parent.relativePath' points at wrong local POM @ line 22, column 11

at 
org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)
at 
org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)
at 
org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)
at 
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)
at 
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)
at 
com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)
at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)
at 
com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)
at 
com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)
at com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)
at SparkBuild$.projectDefinitions(SparkBuild.scala:347)
at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:506)
at sbt.Load$$anonfun$27.apply(Load.scala:446)
at sbt.Load$$anonfun$27.apply(Load.scala:446)
at scala.collection.immutable.Stream.flatMap(Stream.scala:442)
at sbt.Load$.loadUnit(Load.scala:446)
at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
at 
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:91)
at 
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:90)
at sbt.BuildLoader.apply(BuildLoader.scala:140)
at sbt.Load$.loadAll(Load.scala:344)
at sbt.Load$.loadURI(Load.scala:299)
at sbt.Load$.load(Load.scala:295)
at sbt.Load$.load(Load.scala:286)
at sbt.Load$.apply(Load.scala:140)
at sbt.Load$.defaultLoad(Load.scala:36)
at sbt.BuiltinCommands$.liftedTree1$1(Main.scala:492)
at sbt.BuiltinCommands$.doLoadProject(Main.scala:492)
at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:6

RE: Unable run Spark in YARN mode

2016-04-10 Thread Yu, Yucai
Could you follow this guide 
http://spark.apache.org/docs/latest/running-on-yarn.html#configuration?

Thanks,
Yucai

-Original Message-
From: maheshmath [mailto:mahesh.m...@gmail.com] 
Sent: Saturday, April 9, 2016 1:58 PM
To: user@spark.apache.org
Subject: Unable run Spark in YARN mode

I have set SPARK_LOCAL_IP=127.0.0.1 still getting below error

16/04/09 10:36:50 INFO spark.SecurityManager: Changing view acls to: mahesh
16/04/09 10:36:50 INFO spark.SecurityManager: Changing modify acls to:
mahesh
16/04/09 10:36:50 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(mahesh); users with modify permissions: Set(mahesh)
16/04/09 10:36:51 INFO util.Utils: Successfully started service
'sparkDriver' on port 43948.
16/04/09 10:36:51 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/04/09 10:36:51 INFO Remoting: Starting remoting
16/04/09 10:36:52 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriverActorSystem@127.0.0.1:32792]
16/04/09 10:36:52 INFO util.Utils: Successfully started service
'sparkDriverActorSystem' on port 32792.
16/04/09 10:36:52 INFO spark.SparkEnv: Registering MapOutputTracker
16/04/09 10:36:52 INFO spark.SparkEnv: Registering BlockManagerMaster
16/04/09 10:36:52 INFO storage.DiskBlockManager: Created local directory at
/tmp/blockmgr-a2079037-6bbe-49ce-ba78-d475e38ad362
16/04/09 10:36:52 INFO storage.MemoryStore: MemoryStore started with
capacity 517.4 MB
16/04/09 10:36:52 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/04/09 10:36:53 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/04/09 10:36:53 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
16/04/09 10:36:53 INFO util.Utils: Successfully started service 'SparkUI' on
port 4040.
16/04/09 10:36:53 INFO ui.SparkUI: Started SparkUI at http://127.0.0.1:4040
16/04/09 10:36:53 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032
16/04/09 10:36:54 INFO yarn.Client: Requesting a new application from
cluster with 1 NodeManagers
16/04/09 10:36:54 INFO yarn.Client: Verifying our application has not
requested more than the maximum memory capability of the cluster (8192 MB
per container)
16/04/09 10:36:54 INFO yarn.Client: Will allocate AM container, with 896 MB
memory including 384 MB overhead
16/04/09 10:36:54 INFO yarn.Client: Setting up container launch context for
our AM
16/04/09 10:36:54 INFO yarn.Client: Setting up the launch environment for
our AM container
16/04/09 10:36:54 INFO yarn.Client: Preparing resources for our AM container
16/04/09 10:36:56 INFO yarn.Client: Uploading resource
file:/home/mahesh/Programs/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar
->
hdfs://localhost:54310/user/mahesh/.sparkStaging/application_1460137661144_0003/spark-assembly-1.6.1-hadoop2.6.0.jar
16/04/09 10:36:59 INFO yarn.Client: Uploading resource
file:/tmp/spark-f28e3fd5-4dcd-4199-b298-c7fc607dedb4/__spark_conf__5551799952710555772.zip
->
hdfs://localhost:54310/user/mahesh/.sparkStaging/application_1460137661144_0003/__spark_conf__5551799952710555772.zip
16/04/09 10:36:59 INFO spark.SecurityManager: Changing view acls to: mahesh
16/04/09 10:36:59 INFO spark.SecurityManager: Changing modify acls to:
mahesh
16/04/09 10:36:59 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(mahesh); users with modify permissions: Set(mahesh)
16/04/09 10:36:59 INFO yarn.Client: Submitting application 3 to
ResourceManager
16/04/09 10:36:59 INFO impl.YarnClientImpl: Submitted application
application_1460137661144_0003
16/04/09 10:37:00 INFO yarn.Client: Application report for
application_1460137661144_0003 (state: ACCEPTED)
16/04/09 10:37:00 INFO yarn.Client: 
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1460178419692
 final status: UNDEFINED
 tracking URL: http://gubbi:8088/proxy/application_1460137661144_0003/
 user: mahesh
16/04/09 10:37:01 INFO yarn.Client: Application report for
application_1460137661144_0003 (state: ACCEPTED)
16/04/09 10:37:02 INFO yarn.Client: Application report for
application_1460137661144_0003 (state: ACCEPTED)
16/04/09 10:37:03 INFO yarn.Client: Application report for
application_1460137661144_0003 (state: ACCEPTED)
16/04/09 10:37:04 INFO yarn.Client: Application report for
application_1460137661144_0003 (state: ACCEPTED)
16/04/09 10:37:05 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster registered as NettyRpcEndpointRef(null)
16/04/09 10:37:05 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter.
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS
-> gubbi, PROXY_URI_BASES ->
http://gubbi:8088/proxy/application_1460137661144_0003),
/proxy/application_1460137661144_0003
16/04/09 10:37:05 INFO ui.JettyUtils: Addin

RE: ctas fails with "No plan for CreateTableAsSelect"

2016-01-27 Thread Yu, Yucai
As per this document: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS),
 Hive CTAS has the restriction that the target table cannot be a partitioned 
table.
Tejas also pointed out you has no need specify the column information as it is 
derived from the result of SELECT statement, so the correct way to write a CTAS 
is like Hive’s CTAS example:

CREATE TABLE new_key_value_store
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
   STORED AS RCFile
   AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;

It works well in Spark/beeline also.

Thanks,
Yucai

From: Younes Naguib [mailto:younes.nag...@tritondigital.com]
Sent: Wednesday, January 27, 2016 2:33 AM
To: 'Tejas Patil' ; 'yuzhih...@gmail.com' 

Cc: user@spark.apache.org
Subject: RE: ctas fails with "No plan for CreateTableAsSelect"

It seems that for partitioned tables, you need to create the table 1st, and run 
an insert into table to take advantage of the dynamic partition allocation.
That worked for me.

@Ted I just realized you were asking for a complete stack trace.
2016-01-26 15:36:04 ERROR SparkExecuteStatementOperation:95 - Error executing 
query, currentState RUNNING,
java.lang.AssertionError: assertion failed: No plan for CreateTableAsSelect 
HiveTable(Some(default),tab1, ArrayBuffer(col1 ,timestamp,null), 
HiveColumn(col2,string,null), HiveColumn(col3,int,null), 
HiveColumn(col4,int,null) 
HiveColumn(overflow,array,null)),ArrayBuffer(HiveColumn(year,int,null), 
HiveColumn(month,int,null), 
HiveColumn(day,int,null)),Map(),Map(),ManagedTable,Some(hdfs://localhost:9000/tab1),Some(org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat),Some(org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat),Some(org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe),None),
 false
{Huge explain plan}

at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:47)
at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:45)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:52)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:52)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Thanks for all the suggestions.

From: Younes Naguib [mailto:younes.nag...@tritondigital.com]
Sent: January-26-16 11:42 AM
To: 'Tejas Patil'
Cc: user@spark.apache.org
Subject: RE: ctas fails with "No plan for CreateTableAsSelect"

The destination table is partitioned. If I don’t specify the columns I get :
Error: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Partition column name year 
conflicts with table columns. (state=,code=0)

younes

From: Tejas Patil [mailto:tejas.patil...@gmail.com]
Sent: January-26-16 11:39 AM
To: Younes Naguib
Cc: user@spark.apache.org
Subject: Re: ctas fails with "No plan for CreateTableAsSelect"

In CTAS, you should no