I found out the problem. Grouping by a constant column value is indeed impossible. The reason it was "working" in my project is that I gave the constant column an alias that exists in the schema of the dataframe. The dataframe contained a "data_timestamp" representing an hour, and I added to the select a constant "data_timestamp" that represented the timestamp of the day. And that was the cause for my original bug - I thought I was grouping by the day timestamp, when I was actually grouping by each hour, and therefore I got multiple rows for each of the group by combinations.
On Wed, Jul 15, 2015 at 10:09 AM, Lior Chaga <lio...@taboola.com> wrote: > Hi, > > Facing a bug with group by in SparkSQL (version 1.4). > Registered a JavaRDD with object containing integer fields as a table. > > Then I'm trying to do a group by, with a constant value in the group by > fields: > > SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures > FROM tbl > GROUP BY primary_one, primary_two, num > > > I get the following exception: > org.apache.spark.sql.AnalysisException: cannot resolve 'num' given input > columns measure, primary_one, primary_two > > Tried both with HiveContext and SqlContext. > The odd thing is that this kind of query actually works for me in a > project I'm working on, but I have there another bug (the group by does not > yield expected results). > > The only reason I can think of is that maybe in my real project, the > context configuration is different. > In my above example the configuration of the HiveContext is empty. > > In my real project, the configuration is shown below. > Any ideas? > > Thanks, > Lior > > Hive context configuration in project: > "(mapreduce.jobtracker.jobhistory.task.numberprogresssplits,12)" > "(nfs3.mountd.port,4242)" > "(mapreduce.tasktracker.healthchecker.script.timeout,600000)" > "(yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms,1000)" > "(mapreduce.input.fileinputformat.input.dir.recursive,false)" > "(hive.orc.compute.splits.num.threads,10)" > > "(mapreduce.job.classloader.system.classes,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.)" > "(hive.auto.convert.sortmerge.join.to.mapjoin,false)" > "(hadoop.http.authentication.kerberos.principal,HTTP/_HOST@LOCALHOST)" > "(hive.exec.perf.logger,org.apache.hadoop.hive.ql.log.PerfLogger)" > "(hive.mapjoin.lazy.hashtable,true)" > "(mapreduce.framework.name,local)" > "(hive.exec.script.maxerrsize,100000)" > "(dfs.namenode.checkpoint.txns,1000000)" > "(tfile.fs.output.buffer.size,262144)" > "(yarn.app.mapreduce.am.job.task.listener.thread-count,30)" > "(mapreduce.tasktracker.local.dir.minspacekill,0)" > "(hive.support.concurrency,false)" > "(fs.s3.block.size,67108864)" > > "(hive.script.recordwriter,org.apache.hadoop.hive.ql.exec.TextRecordWriter)" > "(hive.stats.retries.max,0)" > "(hadoop.hdfs.configuration.version,1)" > "(dfs.bytes-per-checksum,512)" > "(fs.s3.buffer.dir,${hadoop.tmp.dir}/s3)" > "(mapreduce.job.acl-view-job, )" > "(hive.typecheck.on.insert,true)" > "(mapreduce.jobhistory.loadedjobs.cache.size,5)" > "(mapreduce.jobtracker.persist.jobstatus.hours,1)" > "(hive.unlock.numretries,10)" > "(dfs.namenode.handler.count,10)" > "(mapreduce.input.fileinputformat.split.minsize,1)" > "(hive.plan.serialization.format,kryo)" > "(dfs.datanode.failed.volumes.tolerated,0)" > "(yarn.resourcemanager.container.liveness-monitor.interval-ms,600000)" > "(yarn.resourcemanager.amliveliness-monitor.interval-ms,1000)" > "(yarn.resourcemanager.client.thread-count,50)" > "(io.seqfile.compress.blocksize,1000000)" > "(mapreduce.tasktracker.http.threads,40)" > "(hive.explain.dependency.append.tasktype,false)" > "(dfs.namenode.retrycache.expirytime.millis,600000)" > "(dfs.namenode.backup.address,0.0.0.0:50100)" > "(hive.hwi.listen.host,0.0.0.0)" > "(dfs.datanode.data.dir,file://${hadoop.tmp.dir}/dfs/data)" > "(dfs.replication,3)" > "(mapreduce.jobtracker.jobhistory.block.size,3145728)" > > > "(dfs.secondary.namenode.kerberos.internal.spnego.principal,${dfs.web.authentication.kerberos.principal})" > "(mapreduce.task.profile.maps,0-2)" > "(fs.har.impl,org.apache.hadoop.hive.shims.HiveHarFileSystem)" > "(hive.stats.reliable,false)" > "(yarn.nodemanager.admin-env,MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX)" > >