Re: spark sql - group by constant column

2015-07-15 Thread Lior Chaga
I found out the problem. Grouping by a constant column value is indeed
impossible.
The reason it was working in my project is that I gave the constant
column an alias that exists in the schema of the dataframe. The dataframe
contained a data_timestamp representing an hour, and I added to the
select a constant data_timestamp that represented the timestamp of the
day. And that was the cause for my original bug - I thought I was grouping
by the day timestamp, when I was actually grouping by each hour, and
therefore I got multiple rows for each of the group by combinations.

On Wed, Jul 15, 2015 at 10:09 AM, Lior Chaga lio...@taboola.com wrote:

 Hi,

 Facing a bug with group by in SparkSQL (version 1.4).
 Registered a JavaRDD with object containing integer fields as a table.

 Then I'm trying to do a group by, with a constant value in the group by
 fields:

 SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures
 FROM tbl
 GROUP BY primary_one, primary_two, num


 I get the following exception:
 org.apache.spark.sql.AnalysisException: cannot resolve 'num' given input
 columns measure, primary_one, primary_two

 Tried both with HiveContext and SqlContext.
 The odd thing is that this kind of query actually works for me in a
 project I'm working on, but I have there another bug (the group by does not
 yield expected results).

 The only reason I can think of is that maybe in my real project, the
 context configuration is different.
 In my above example the configuration of the HiveContext is empty.

 In my real project, the configuration is shown below.
 Any ideas?

 Thanks,
 Lior

 Hive context configuration in project:
 (mapreduce.jobtracker.jobhistory.task.numberprogresssplits,12)
 (nfs3.mountd.port,4242)
 (mapreduce.tasktracker.healthchecker.script.timeout,60)
 (yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms,1000)
 (mapreduce.input.fileinputformat.input.dir.recursive,false)
 (hive.orc.compute.splits.num.threads,10)

 (mapreduce.job.classloader.system.classes,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.)
 (hive.auto.convert.sortmerge.join.to.mapjoin,false)
 (hadoop.http.authentication.kerberos.principal,HTTP/_HOST@LOCALHOST)
 (hive.exec.perf.logger,org.apache.hadoop.hive.ql.log.PerfLogger)
  (hive.mapjoin.lazy.hashtable,true)
  (mapreduce.framework.name,local)
  (hive.exec.script.maxerrsize,10)
  (dfs.namenode.checkpoint.txns,100)
  (tfile.fs.output.buffer.size,262144)
  (yarn.app.mapreduce.am.job.task.listener.thread-count,30)
  (mapreduce.tasktracker.local.dir.minspacekill,0)
  (hive.support.concurrency,false)
  (fs.s3.block.size,67108864)

  (hive.script.recordwriter,org.apache.hadoop.hive.ql.exec.TextRecordWriter)
  (hive.stats.retries.max,0)
  (hadoop.hdfs.configuration.version,1)
  (dfs.bytes-per-checksum,512)
  (fs.s3.buffer.dir,${hadoop.tmp.dir}/s3)
  (mapreduce.job.acl-view-job, )
  (hive.typecheck.on.insert,true)
  (mapreduce.jobhistory.loadedjobs.cache.size,5)
  (mapreduce.jobtracker.persist.jobstatus.hours,1)
  (hive.unlock.numretries,10)
  (dfs.namenode.handler.count,10)
  (mapreduce.input.fileinputformat.split.minsize,1)
  (hive.plan.serialization.format,kryo)
  (dfs.datanode.failed.volumes.tolerated,0)
  (yarn.resourcemanager.container.liveness-monitor.interval-ms,60)
  (yarn.resourcemanager.amliveliness-monitor.interval-ms,1000)
  (yarn.resourcemanager.client.thread-count,50)
  (io.seqfile.compress.blocksize,100)
  (mapreduce.tasktracker.http.threads,40)
  (hive.explain.dependency.append.tasktype,false)
  (dfs.namenode.retrycache.expirytime.millis,60)
  (dfs.namenode.backup.address,0.0.0.0:50100)
  (hive.hwi.listen.host,0.0.0.0)
  (dfs.datanode.data.dir,file://${hadoop.tmp.dir}/dfs/data)
  (dfs.replication,3)
  (mapreduce.jobtracker.jobhistory.block.size,3145728)

  
 (dfs.secondary.namenode.kerberos.internal.spnego.principal,${dfs.web.authentication.kerberos.principal})
  (mapreduce.task.profile.maps,0-2)
  (fs.har.impl,org.apache.hadoop.hive.shims.HiveHarFileSystem)
  (hive.stats.reliable,false)
  (yarn.nodemanager.admin-env,MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX)




spark sql - group by constant column

2015-07-15 Thread Lior Chaga
Hi,

Facing a bug with group by in SparkSQL (version 1.4).
Registered a JavaRDD with object containing integer fields as a table.

Then I'm trying to do a group by, with a constant value in the group by
fields:

SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures
FROM tbl
GROUP BY primary_one, primary_two, num


I get the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'num' given input
columns measure, primary_one, primary_two

Tried both with HiveContext and SqlContext.
The odd thing is that this kind of query actually works for me in a project
I'm working on, but I have there another bug (the group by does not yield
expected results).

The only reason I can think of is that maybe in my real project, the
context configuration is different.
In my above example the configuration of the HiveContext is empty.

In my real project, the configuration is shown below.
Any ideas?

Thanks,
Lior

Hive context configuration in project:
(mapreduce.jobtracker.jobhistory.task.numberprogresssplits,12)
(nfs3.mountd.port,4242)
(mapreduce.tasktracker.healthchecker.script.timeout,60)
(yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms,1000)
(mapreduce.input.fileinputformat.input.dir.recursive,false)
(hive.orc.compute.splits.num.threads,10)
(mapreduce.job.classloader.system.classes,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.)
(hive.auto.convert.sortmerge.join.to.mapjoin,false)
(hadoop.http.authentication.kerberos.principal,HTTP/_HOST@LOCALHOST)
(hive.exec.perf.logger,org.apache.hadoop.hive.ql.log.PerfLogger)
 (hive.mapjoin.lazy.hashtable,true)
 (mapreduce.framework.name,local)
 (hive.exec.script.maxerrsize,10)
 (dfs.namenode.checkpoint.txns,100)
 (tfile.fs.output.buffer.size,262144)
 (yarn.app.mapreduce.am.job.task.listener.thread-count,30)
 (mapreduce.tasktracker.local.dir.minspacekill,0)
 (hive.support.concurrency,false)
 (fs.s3.block.size,67108864)
 (hive.script.recordwriter,org.apache.hadoop.hive.ql.exec.TextRecordWriter)
 (hive.stats.retries.max,0)
 (hadoop.hdfs.configuration.version,1)
 (dfs.bytes-per-checksum,512)
 (fs.s3.buffer.dir,${hadoop.tmp.dir}/s3)
 (mapreduce.job.acl-view-job, )
 (hive.typecheck.on.insert,true)
 (mapreduce.jobhistory.loadedjobs.cache.size,5)
 (mapreduce.jobtracker.persist.jobstatus.hours,1)
 (hive.unlock.numretries,10)
 (dfs.namenode.handler.count,10)
 (mapreduce.input.fileinputformat.split.minsize,1)
 (hive.plan.serialization.format,kryo)
 (dfs.datanode.failed.volumes.tolerated,0)
 (yarn.resourcemanager.container.liveness-monitor.interval-ms,60)
 (yarn.resourcemanager.amliveliness-monitor.interval-ms,1000)
 (yarn.resourcemanager.client.thread-count,50)
 (io.seqfile.compress.blocksize,100)
 (mapreduce.tasktracker.http.threads,40)
 (hive.explain.dependency.append.tasktype,false)
 (dfs.namenode.retrycache.expirytime.millis,60)
 (dfs.namenode.backup.address,0.0.0.0:50100)
 (hive.hwi.listen.host,0.0.0.0)
 (dfs.datanode.data.dir,file://${hadoop.tmp.dir}/dfs/data)
 (dfs.replication,3)
 (mapreduce.jobtracker.jobhistory.block.size,3145728)
 
(dfs.secondary.namenode.kerberos.internal.spnego.principal,${dfs.web.authentication.kerberos.principal})
 (mapreduce.task.profile.maps,0-2)
 (fs.har.impl,org.apache.hadoop.hive.shims.HiveHarFileSystem)
 (hive.stats.reliable,false)
 (yarn.nodemanager.admin-env,MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX)