Re: spark sql - group by constant column

Lior Chaga Wed, 15 Jul 2015 01:15:44 -0700

I found out the problem. Grouping by a constant column value is indeed
impossible.
The reason it was "working" in my project is that I gave the constant
column an alias that exists in the schema of the dataframe. The dataframe
contained a "data_timestamp" representing an hour, and I added to the
select a constant "data_timestamp" that represented the timestamp of the
day. And that was the cause for my original bug - I thought I was grouping
by the day timestamp, when I was actually grouping by each hour, and
therefore I got multiple rows for each of the group by combinations.


On Wed, Jul 15, 2015 at 10:09 AM, Lior Chaga <lio...@taboola.com> wrote:

> Hi,
>
> Facing a bug with group by in SparkSQL (version 1.4).
> Registered a JavaRDD with object containing integer fields as a table.
>
> Then I'm trying to do a group by, with a constant value in the group by
> fields:
>
> SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures
> FROM tbl
> GROUP BY primary_one, primary_two, num
>
>
> I get the following exception:
> org.apache.spark.sql.AnalysisException: cannot resolve 'num' given input
> columns measure, primary_one, primary_two
>
> Tried both with HiveContext and SqlContext.
> The odd thing is that this kind of query actually works for me in a
> project I'm working on, but I have there another bug (the group by does not
> yield expected results).
>
> The only reason I can think of is that maybe in my real project, the
> context configuration is different.
> In my above example the configuration of the HiveContext is empty.
>
> In my real project, the configuration is shown below.
> Any ideas?
>
> Thanks,
> Lior
>
> Hive context configuration in project:
> "(mapreduce.jobtracker.jobhistory.task.numberprogresssplits,12)"
> "(nfs3.mountd.port,4242)"
> "(mapreduce.tasktracker.healthchecker.script.timeout,600000)"
> "(yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms,1000)"
> "(mapreduce.input.fileinputformat.input.dir.recursive,false)"
> "(hive.orc.compute.splits.num.threads,10)"
>
> "(mapreduce.job.classloader.system.classes,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.)"
> "(hive.auto.convert.sortmerge.join.to.mapjoin,false)"
> "(hadoop.http.authentication.kerberos.principal,HTTP/_HOST@LOCALHOST)"
> "(hive.exec.perf.logger,org.apache.hadoop.hive.ql.log.PerfLogger)"
>  "(hive.mapjoin.lazy.hashtable,true)"
>  "(mapreduce.framework.name,local)"
>  "(hive.exec.script.maxerrsize,100000)"
>  "(dfs.namenode.checkpoint.txns,1000000)"
>  "(tfile.fs.output.buffer.size,262144)"
>  "(yarn.app.mapreduce.am.job.task.listener.thread-count,30)"
>  "(mapreduce.tasktracker.local.dir.minspacekill,0)"
>  "(hive.support.concurrency,false)"
>  "(fs.s3.block.size,67108864)"
>
>  "(hive.script.recordwriter,org.apache.hadoop.hive.ql.exec.TextRecordWriter)"
>  "(hive.stats.retries.max,0)"
>  "(hadoop.hdfs.configuration.version,1)"
>  "(dfs.bytes-per-checksum,512)"
>  "(fs.s3.buffer.dir,${hadoop.tmp.dir}/s3)"
>  "(mapreduce.job.acl-view-job, )"
>  "(hive.typecheck.on.insert,true)"
>  "(mapreduce.jobhistory.loadedjobs.cache.size,5)"
>  "(mapreduce.jobtracker.persist.jobstatus.hours,1)"
>  "(hive.unlock.numretries,10)"
>  "(dfs.namenode.handler.count,10)"
>  "(mapreduce.input.fileinputformat.split.minsize,1)"
>  "(hive.plan.serialization.format,kryo)"
>  "(dfs.datanode.failed.volumes.tolerated,0)"
>  "(yarn.resourcemanager.container.liveness-monitor.interval-ms,600000)"
>  "(yarn.resourcemanager.amliveliness-monitor.interval-ms,1000)"
>  "(yarn.resourcemanager.client.thread-count,50)"
>  "(io.seqfile.compress.blocksize,1000000)"
>  "(mapreduce.tasktracker.http.threads,40)"
>  "(hive.explain.dependency.append.tasktype,false)"
>  "(dfs.namenode.retrycache.expirytime.millis,600000)"
>  "(dfs.namenode.backup.address,0.0.0.0:50100)"
>  "(hive.hwi.listen.host,0.0.0.0)"
>  "(dfs.datanode.data.dir,file://${hadoop.tmp.dir}/dfs/data)"
>  "(dfs.replication,3)"
>  "(mapreduce.jobtracker.jobhistory.block.size,3145728)"
>
>  
> "(dfs.secondary.namenode.kerberos.internal.spnego.principal,${dfs.web.authentication.kerberos.principal})"
>  "(mapreduce.task.profile.maps,0-2)"
>  "(fs.har.impl,org.apache.hadoop.hive.shims.HiveHarFileSystem)"
>  "(hive.stats.reliable,false)"
>  "(yarn.nodemanager.admin-env,MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX)"
>
>

Re: spark sql - group by constant column

Reply via email to