Re: TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY

2014-07-21 Thread Martin Gammelsæter
Aha, that makes sense. Thanks for the response! I guess one of the
areas Spark could need some love in in error messages (:

On Fri, Jul 18, 2014 at 9:41 PM, Michael Armbrust
mich...@databricks.com wrote:
 Sorry for the non-obvious error message.  It is not valid SQL to include
 attributes in the select clause unless they are also in the group by clause
 or are inside of an aggregate function.

 On Jul 18, 2014 5:12 AM, Martin Gammelsæter martingammelsae...@gmail.com
 wrote:

 Hi again!

 I am having problems when using GROUP BY on both SQLContext and
 HiveContext (same problem).

 My code (simplified as much as possible) can be seen here:
 http://pastebin.com/33rjW67H

 In short, I'm getting data from a Cassandra store with Datastax' new
 driver (which works great by the way, recommended!), and mapping it to
 a Spark SQL table through a Product class (Dokument in the source).
 Regular SELECTs and stuff works fine, but once I try to do a GROUP BY,
 I get the following error:

 Exception in thread main org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 0.0:25 failed 4 times, most recent
 failure: Exception failure in TID 63 on host 192.168.121.132:
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No
 function to evaluate expression. type: AttributeReference, tree: id#0

 org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:158)

 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64)

 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.next(Aggregate.scala:195)

 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.next(Aggregate.scala:174)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 scala.collection.AbstractIterator.to(Iterator.scala:1157)

 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:750)
 org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:750)

 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1096)

 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1096)

 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)

 What am I doing wrong?

 --
 Best regards,
 Martin Gammelsæter



-- 
Mvh.
Martin Gammelsæter
92209139


TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY

2014-07-18 Thread Martin Gammelsæter
Hi again!

I am having problems when using GROUP BY on both SQLContext and
HiveContext (same problem).

My code (simplified as much as possible) can be seen here:
http://pastebin.com/33rjW67H

In short, I'm getting data from a Cassandra store with Datastax' new
driver (which works great by the way, recommended!), and mapping it to
a Spark SQL table through a Product class (Dokument in the source).
Regular SELECTs and stuff works fine, but once I try to do a GROUP BY,
I get the following error:

Exception in thread main org.apache.spark.SparkException: Job
aborted due to stage failure: Task 0.0:25 failed 4 times, most recent
failure: Exception failure in TID 63 on host 192.168.121.132:
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No
function to evaluate expression. type: AttributeReference, tree: id#0

org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:158)

org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.next(Aggregate.scala:195)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.next(Aggregate.scala:174)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:750)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:750)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1096)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1096)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

What am I doing wrong?

-- 
Best regards,
Martin Gammelsæter


Re: Supported SQL syntax in Spark SQL

2014-07-14 Thread Martin Gammelsæter
I am very interested in the original question as well, is there any
list (even if it is simply in the code) of all supported syntax for
Spark SQL?

On Mon, Jul 14, 2014 at 6:41 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 Are you sure the code running on the cluster has been updated?

 I launched the cluster using spark-ec2 from the 1.0.1 release, so I’m
 assuming that’s taken care of, at least in theory.

 I just spun down the clusters I had up, but I will revisit this tomorrow and
 provide the information you requested.

 Nick



-- 
Mvh.
Martin Gammelsæter
92209139


Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-09 Thread Martin Gammelsæter
Thanks for your input, Koert and DB. Rebuilding with 9.x didn't seem
to work. For now we've downgraded dropwizard to 0.6.2 which uses a
compatible version of jetty. Not optimal, but it works for now.

On Tue, Jul 8, 2014 at 7:04 PM, DB Tsai dbt...@dbtsai.com wrote:
 We're doing similar thing to lunch spark job in tomcat, and I opened a
 JIRA for this. There are couple technical discussions there.

 https://issues.apache.org/jira/browse/SPARK-2100

 In this end, we realized that spark uses jetty not only for Spark
 WebUI, but also for distributing the jars and tasks, so it really hard
 to remove the web dependency in Spark.

 In the end, we lunch our spark job in yarn-cluster mode, and in the
 runtime, the only dependency in our web application is spark-yarn
 which doesn't contain any spark web stuff.

 PS, upgrading the spark jetty 8.x to 9.x in spark may not be
 straightforward by just changing the version in spark build script.
 Jetty 9.x required Java 7 since the servlet api (servlet 3.1) requires
 Java 7.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Tue, Jul 8, 2014 at 8:43 AM, Koert Kuipers ko...@tresata.com wrote:
 do you control your cluster and spark deployment? if so, you can try to
 rebuild with jetty 9.x


 On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter
 martingammelsae...@gmail.com wrote:

 Digging a bit more I see that there is yet another jetty instance that
 is causing the problem, namely the BroadcastManager has one. I guess
 this one isn't very wise to disable... It might very well be that the
 WebUI is a problem as well, but I guess the code doesn't get far
 enough. Any ideas on how to solve this? Spark seems to use jetty
 8.1.14, while dropwizard uses jetty 9.0.7, so that might be the source
 of the problem. Any ideas?

 On Tue, Jul 8, 2014 at 2:58 PM, Martin Gammelsæter
 martingammelsae...@gmail.com wrote:
  Hi!
 
  I am building a web frontend for a Spark app, allowing users to input
  sql/hql and get results back. When starting a SparkContext from within
  my server code (using jetty/dropwizard) I get the error
 
  java.lang.NoSuchMethodError:
  org.eclipse.jetty.server.AbstractConnector: method init()V not found
 
  when Spark tries to fire up its own jetty server. This does not happen
  when running the same code without my web server. This is probably
  fixable somehow(?) but I'd like to disable the webUI as I don't need
  it, and ideally I would like to access that information
  programatically instead, allowing me to embed it in my own web
  application.
 
  Is this possible?
 
  --
  Best regards,
  Martin Gammelsæter



 --
 Mvh.
 Martin Gammelsæter
 92209139





-- 
Mvh.
Martin Gammelsæter
92209139


Initial job has not accepted any resources means many things

2014-07-09 Thread Martin Gammelsæter
It seems like the Initial job has not accepted any resources; shows
up for a wide variety of different errors (for example the obvious one
where you've requested more memory than is available) but also for
example in the case where the worker nodes does not have the
appropriate code on their class path. Debugging from this error is
very hard as errors does not show up in the logs on the workers. Is
this a known issue? I'm having issues with getting the code to the
workers without using addJar (my code is a fairly static application,
and I'd like to avoid using addJar every time the app starts up, and
instead manually add the jar to the classpath of every worke), but I
can't seem to find out how)

-- 
Best regards,
Martin Gammelsæter


Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread Martin Gammelsæter
Digging a bit more I see that there is yet another jetty instance that
is causing the problem, namely the BroadcastManager has one. I guess
this one isn't very wise to disable... It might very well be that the
WebUI is a problem as well, but I guess the code doesn't get far
enough. Any ideas on how to solve this? Spark seems to use jetty
8.1.14, while dropwizard uses jetty 9.0.7, so that might be the source
of the problem. Any ideas?

On Tue, Jul 8, 2014 at 2:58 PM, Martin Gammelsæter
martingammelsae...@gmail.com wrote:
 Hi!

 I am building a web frontend for a Spark app, allowing users to input
 sql/hql and get results back. When starting a SparkContext from within
 my server code (using jetty/dropwizard) I get the error

 java.lang.NoSuchMethodError:
 org.eclipse.jetty.server.AbstractConnector: method init()V not found

 when Spark tries to fire up its own jetty server. This does not happen
 when running the same code without my web server. This is probably
 fixable somehow(?) but I'd like to disable the webUI as I don't need
 it, and ideally I would like to access that information
 programatically instead, allowing me to embed it in my own web
 application.

 Is this possible?

 --
 Best regards,
 Martin Gammelsæter



-- 
Mvh.
Martin Gammelsæter
92209139


Re: Spark SQL user defined functions

2014-07-07 Thread Martin Gammelsæter
Hi again, and thanks for your reply!

On Fri, Jul 4, 2014 at 8:45 PM, Michael Armbrust mich...@databricks.com wrote:

 Sweet. Any idea about when this will be merged into master?


 It is probably going to be a couple of weeks.  There is a fair amount of
 cleanup that needs to be done.  It works though and we used it in most of
 the demos at the spark summit.  Mostly I just need to add tests and move it
 out of HiveContext (there is no good reason for that code to depend on
 HiveContext). So you could also just try working with that branch.


 This is probably a stupid question, but can you query Spark SQL tables
 from a (local?) hive context? In which case using that could be a
 workaround until the PR is merged.


 Yeah, this is kind of subtle.  In a HiveContext, SQL Tables are just an
 additional catalog that sits on top of the metastore.  All the query
 execution occurs in the same code path, including the use of the Hive
 Function Registry, independent of where the table comes from.  So for your
 use case you can just create a hive context, which will create a local
 metastore automatically if no hive-site.xml is present.

Nice, that sounds like it'll solve my problems. Just for clarity, is
LocalHiveContext and HiveContext equal if no hive-site.xml is present,
or are there still differences?

-- 
Best regards,
Martin Gammelsæter


Re: How to use groupByKey and CqlPagingInputFormat

2014-07-05 Thread Martin Gammelsæter
Ah, I see. Thank you!

As we are in the process of building the system we have not tried with
any large amounts of data yet, but when the time comes I'll try both
implementations and do a small benchmark.

On Fri, Jul 4, 2014 at 9:20 PM, Mohammed Guller moham...@glassbeam.com wrote:
 As far as I know, there is not much difference, except that the outer 
 parenthesis is redundant. The problem with your original code was that there 
 was mismatch in the opening and closing parenthesis. Sometimes the error 
 messages are misleading :-)

 Do you see any performance difference with the Datastax spark driver?

 Mohammed

 -Original Message-
 From: Martin Gammelsæter [mailto:martingammelsae...@gmail.com]
 Sent: Friday, July 4, 2014 12:43 AM
 To: user@spark.apache.org
 Subject: Re: How to use groupByKey and CqlPagingInputFormat

 On Thu, Jul 3, 2014 at 10:29 PM, Mohammed Guller moham...@glassbeam.com 
 wrote:
 Martin,

 1) The first map contains the columns in the primary key, which could be a 
 compound primary key containing multiple columns,  and the second map 
 contains all the non-key columns.

 Ah, thank you, that makes sense.

 2) try this fixed code:
 val navnrevmap = casRdd.map{
   case (key, value) =
 (ByteBufferUtil.string(value.get(navn)),
ByteBufferUtil.toInt(value.get(revisjon)))
}.groupByKey()

 I changed from CqlPagingInputFormat to the new Datastax cassandra-spark 
 driver, which is a bit easier to work with, but thanks! I'm curious though, 
 what is the semantic difference between
 map({}) and map{}?



-- 
Mvh.
Martin Gammelsæter
92209139


Spark SQL user defined functions

2014-07-04 Thread Martin Gammelsæter
Hi!

I have a Spark cluster running on top of a Cassandra cluster, using
Datastax' new driver, and one of the fields of my RDDs is an
XML-string. In a normal Scala sparkjob, parsing that data is no
problem, but I would like to also make that information available
through Spark SQL. So, is there any way to write user defined
functions for Spark SQL? I know that a HiveContext is available, but I
understand that that is for querying data from Hive, and I don't have
Hive in my stack (please correct me if I'm wrong).

I would love to be able to do something like the following:

val casRdd = sparkCtx.cassandraTable(ks, cf)

// registerAsTable etc

val res = sql(SELECT id, xmlGetTag(xmlfield, 'sometag') FROM cf)

-- 
Best regards,
Martin Gammelsæter


Re: Spark SQL user defined functions

2014-07-04 Thread Martin Gammelsæter
Takuya, thanks for your reply :)
I am already doing that, and it is working well. My question is, can I
define arbitrary functions to be used in these queries?

On Fri, Jul 4, 2014 at 11:12 AM, Takuya UESHIN ues...@happy-camper.st wrote:
 Hi,

 You can convert standard RDD of Product class (e.g. case class) to SchemaRDD
 by SQLContext.
 Load data from Cassandra into RDD of case class, convert it to SchemaRDD and
 register it,
 then you can use it in your SQLs.

 http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-on-rdds

 Thanks.



 2014-07-04 17:59 GMT+09:00 Martin Gammelsæter
 martingammelsae...@gmail.com:

 Hi!

 I have a Spark cluster running on top of a Cassandra cluster, using
 Datastax' new driver, and one of the fields of my RDDs is an
 XML-string. In a normal Scala sparkjob, parsing that data is no
 problem, but I would like to also make that information available
 through Spark SQL. So, is there any way to write user defined
 functions for Spark SQL? I know that a HiveContext is available, but I
 understand that that is for querying data from Hive, and I don't have
 Hive in my stack (please correct me if I'm wrong).

 I would love to be able to do something like the following:

 val casRdd = sparkCtx.cassandraTable(ks, cf)

 // registerAsTable etc

 val res = sql(SELECT id, xmlGetTag(xmlfield, 'sometag') FROM cf)

 --
 Best regards,
 Martin Gammelsæter




 --
 Takuya UESHIN
 Tokyo, Japan

 http://twitter.com/ueshin



-- 
Mvh.
Martin Gammelsæter
92209139


How to use groupByKey and CqlPagingInputFormat

2014-07-02 Thread Martin Gammelsæter
Hi!

Total Scala and Spark noob here with a few questions.

I am trying to modify a few of the examples in the spark repo to fit
my needs, but running into a few problems.

I am making an RDD from Cassandra, which I've finally gotten to work,
and trying to do some operations on it. Specifically I am trying to do
a grouping by key for future calculations.
I want the key to be the column navn from a certain column family,
but I don't think I understand the returned types. Why are two Maps
returned, instead of one? I'd think that you'd get a list of some kind
with every row, where every element in the list was a map from column
name to the value. So my first question is: What do these maps
represent?

   val casRdd = sc.newAPIHadoopRDD(job.getConfiguration(),
  classOf[CqlPagingInputFormat],
  classOf[java.util.Map[String,ByteBuffer]],
  classOf[java.util.Map[String,ByteBuffer]])

val navnrevmap = casRdd.map({
  case (key, value) =
(ByteBufferUtil.string(value.get(navn)),
ByteBufferUtil.toInt(value.get(revisjon))
}).groupByKey()

The second question (probably stemming from my not understanding the
first question) is why am I not allowed to do a groupByKey in the
above code? I understand that the type does not have that function,
but I'm unclear on what I have to do to make it work.

-- 
Best regards,
Martin Gammelsæter


How to use .newAPIHadoopRDD() from Java (w/ Cassandra)

2014-06-27 Thread Martin Gammelsæter
Hello!

I have just started trying out Spark to see if it fits my needs, but I
am running into some issues when trying to port the
CassandraCQLTest.scala example into Java. The specific errors etc.
that I encounter can be seen here:

http://stackoverflow.com/questions/24450540/how-to-use-sparks-newapihadooprdd-java-equivalent-of-scalas-classof

where I have also asked the same question. Any pointers on how to use
.newAPIHadoopRDD() and CqlPagingInputFormat from Java is greatly
appreciated! (Either here or on Stack Overflow)

-- 
Best regards,
Martin Gammelsæter