Re: TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY
Aha, that makes sense. Thanks for the response! I guess one of the areas Spark could need some love in in error messages (: On Fri, Jul 18, 2014 at 9:41 PM, Michael Armbrust mich...@databricks.com wrote: Sorry for the non-obvious error message. It is not valid SQL to include attributes in the select clause unless they are also in the group by clause or are inside of an aggregate function. On Jul 18, 2014 5:12 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Hi again! I am having problems when using GROUP BY on both SQLContext and HiveContext (same problem). My code (simplified as much as possible) can be seen here: http://pastebin.com/33rjW67H In short, I'm getting data from a Cassandra store with Datastax' new driver (which works great by the way, recommended!), and mapping it to a Spark SQL table through a Product class (Dokument in the source). Regular SELECTs and stuff works fine, but once I try to do a GROUP BY, I get the following error: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:25 failed 4 times, most recent failure: Exception failure in TID 63 on host 192.168.121.132: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:158) org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.next(Aggregate.scala:195) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.next(Aggregate.scala:174) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:750) org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:750) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1096) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1096) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) What am I doing wrong? -- Best regards, Martin Gammelsæter -- Mvh. Martin Gammelsæter 92209139
TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY
Hi again! I am having problems when using GROUP BY on both SQLContext and HiveContext (same problem). My code (simplified as much as possible) can be seen here: http://pastebin.com/33rjW67H In short, I'm getting data from a Cassandra store with Datastax' new driver (which works great by the way, recommended!), and mapping it to a Spark SQL table through a Product class (Dokument in the source). Regular SELECTs and stuff works fine, but once I try to do a GROUP BY, I get the following error: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:25 failed 4 times, most recent failure: Exception failure in TID 63 on host 192.168.121.132: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:158) org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.next(Aggregate.scala:195) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.next(Aggregate.scala:174) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:750) org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:750) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1096) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1096) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) What am I doing wrong? -- Best regards, Martin Gammelsæter
Re: Supported SQL syntax in Spark SQL
I am very interested in the original question as well, is there any list (even if it is simply in the code) of all supported syntax for Spark SQL? On Mon, Jul 14, 2014 at 6:41 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Are you sure the code running on the cluster has been updated? I launched the cluster using spark-ec2 from the 1.0.1 release, so I’m assuming that’s taken care of, at least in theory. I just spun down the clusters I had up, but I will revisit this tomorrow and provide the information you requested. Nick -- Mvh. Martin Gammelsæter 92209139
Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?
Thanks for your input, Koert and DB. Rebuilding with 9.x didn't seem to work. For now we've downgraded dropwizard to 0.6.2 which uses a compatible version of jetty. Not optimal, but it works for now. On Tue, Jul 8, 2014 at 7:04 PM, DB Tsai dbt...@dbtsai.com wrote: We're doing similar thing to lunch spark job in tomcat, and I opened a JIRA for this. There are couple technical discussions there. https://issues.apache.org/jira/browse/SPARK-2100 In this end, we realized that spark uses jetty not only for Spark WebUI, but also for distributing the jars and tasks, so it really hard to remove the web dependency in Spark. In the end, we lunch our spark job in yarn-cluster mode, and in the runtime, the only dependency in our web application is spark-yarn which doesn't contain any spark web stuff. PS, upgrading the spark jetty 8.x to 9.x in spark may not be straightforward by just changing the version in spark build script. Jetty 9.x required Java 7 since the servlet api (servlet 3.1) requires Java 7. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Jul 8, 2014 at 8:43 AM, Koert Kuipers ko...@tresata.com wrote: do you control your cluster and spark deployment? if so, you can try to rebuild with jetty 9.x On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Digging a bit more I see that there is yet another jetty instance that is causing the problem, namely the BroadcastManager has one. I guess this one isn't very wise to disable... It might very well be that the WebUI is a problem as well, but I guess the code doesn't get far enough. Any ideas on how to solve this? Spark seems to use jetty 8.1.14, while dropwizard uses jetty 9.0.7, so that might be the source of the problem. Any ideas? On Tue, Jul 8, 2014 at 2:58 PM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Hi! I am building a web frontend for a Spark app, allowing users to input sql/hql and get results back. When starting a SparkContext from within my server code (using jetty/dropwizard) I get the error java.lang.NoSuchMethodError: org.eclipse.jetty.server.AbstractConnector: method init()V not found when Spark tries to fire up its own jetty server. This does not happen when running the same code without my web server. This is probably fixable somehow(?) but I'd like to disable the webUI as I don't need it, and ideally I would like to access that information programatically instead, allowing me to embed it in my own web application. Is this possible? -- Best regards, Martin Gammelsæter -- Mvh. Martin Gammelsæter 92209139 -- Mvh. Martin Gammelsæter 92209139
Initial job has not accepted any resources means many things
It seems like the Initial job has not accepted any resources; shows up for a wide variety of different errors (for example the obvious one where you've requested more memory than is available) but also for example in the case where the worker nodes does not have the appropriate code on their class path. Debugging from this error is very hard as errors does not show up in the logs on the workers. Is this a known issue? I'm having issues with getting the code to the workers without using addJar (my code is a fairly static application, and I'd like to avoid using addJar every time the app starts up, and instead manually add the jar to the classpath of every worke), but I can't seem to find out how) -- Best regards, Martin Gammelsæter
Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?
Digging a bit more I see that there is yet another jetty instance that is causing the problem, namely the BroadcastManager has one. I guess this one isn't very wise to disable... It might very well be that the WebUI is a problem as well, but I guess the code doesn't get far enough. Any ideas on how to solve this? Spark seems to use jetty 8.1.14, while dropwizard uses jetty 9.0.7, so that might be the source of the problem. Any ideas? On Tue, Jul 8, 2014 at 2:58 PM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Hi! I am building a web frontend for a Spark app, allowing users to input sql/hql and get results back. When starting a SparkContext from within my server code (using jetty/dropwizard) I get the error java.lang.NoSuchMethodError: org.eclipse.jetty.server.AbstractConnector: method init()V not found when Spark tries to fire up its own jetty server. This does not happen when running the same code without my web server. This is probably fixable somehow(?) but I'd like to disable the webUI as I don't need it, and ideally I would like to access that information programatically instead, allowing me to embed it in my own web application. Is this possible? -- Best regards, Martin Gammelsæter -- Mvh. Martin Gammelsæter 92209139
Re: Spark SQL user defined functions
Hi again, and thanks for your reply! On Fri, Jul 4, 2014 at 8:45 PM, Michael Armbrust mich...@databricks.com wrote: Sweet. Any idea about when this will be merged into master? It is probably going to be a couple of weeks. There is a fair amount of cleanup that needs to be done. It works though and we used it in most of the demos at the spark summit. Mostly I just need to add tests and move it out of HiveContext (there is no good reason for that code to depend on HiveContext). So you could also just try working with that branch. This is probably a stupid question, but can you query Spark SQL tables from a (local?) hive context? In which case using that could be a workaround until the PR is merged. Yeah, this is kind of subtle. In a HiveContext, SQL Tables are just an additional catalog that sits on top of the metastore. All the query execution occurs in the same code path, including the use of the Hive Function Registry, independent of where the table comes from. So for your use case you can just create a hive context, which will create a local metastore automatically if no hive-site.xml is present. Nice, that sounds like it'll solve my problems. Just for clarity, is LocalHiveContext and HiveContext equal if no hive-site.xml is present, or are there still differences? -- Best regards, Martin Gammelsæter
Re: How to use groupByKey and CqlPagingInputFormat
Ah, I see. Thank you! As we are in the process of building the system we have not tried with any large amounts of data yet, but when the time comes I'll try both implementations and do a small benchmark. On Fri, Jul 4, 2014 at 9:20 PM, Mohammed Guller moham...@glassbeam.com wrote: As far as I know, there is not much difference, except that the outer parenthesis is redundant. The problem with your original code was that there was mismatch in the opening and closing parenthesis. Sometimes the error messages are misleading :-) Do you see any performance difference with the Datastax spark driver? Mohammed -Original Message- From: Martin Gammelsæter [mailto:martingammelsae...@gmail.com] Sent: Friday, July 4, 2014 12:43 AM To: user@spark.apache.org Subject: Re: How to use groupByKey and CqlPagingInputFormat On Thu, Jul 3, 2014 at 10:29 PM, Mohammed Guller moham...@glassbeam.com wrote: Martin, 1) The first map contains the columns in the primary key, which could be a compound primary key containing multiple columns, and the second map contains all the non-key columns. Ah, thank you, that makes sense. 2) try this fixed code: val navnrevmap = casRdd.map{ case (key, value) = (ByteBufferUtil.string(value.get(navn)), ByteBufferUtil.toInt(value.get(revisjon))) }.groupByKey() I changed from CqlPagingInputFormat to the new Datastax cassandra-spark driver, which is a bit easier to work with, but thanks! I'm curious though, what is the semantic difference between map({}) and map{}? -- Mvh. Martin Gammelsæter 92209139
Spark SQL user defined functions
Hi! I have a Spark cluster running on top of a Cassandra cluster, using Datastax' new driver, and one of the fields of my RDDs is an XML-string. In a normal Scala sparkjob, parsing that data is no problem, but I would like to also make that information available through Spark SQL. So, is there any way to write user defined functions for Spark SQL? I know that a HiveContext is available, but I understand that that is for querying data from Hive, and I don't have Hive in my stack (please correct me if I'm wrong). I would love to be able to do something like the following: val casRdd = sparkCtx.cassandraTable(ks, cf) // registerAsTable etc val res = sql(SELECT id, xmlGetTag(xmlfield, 'sometag') FROM cf) -- Best regards, Martin Gammelsæter
Re: Spark SQL user defined functions
Takuya, thanks for your reply :) I am already doing that, and it is working well. My question is, can I define arbitrary functions to be used in these queries? On Fri, Jul 4, 2014 at 11:12 AM, Takuya UESHIN ues...@happy-camper.st wrote: Hi, You can convert standard RDD of Product class (e.g. case class) to SchemaRDD by SQLContext. Load data from Cassandra into RDD of case class, convert it to SchemaRDD and register it, then you can use it in your SQLs. http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-on-rdds Thanks. 2014-07-04 17:59 GMT+09:00 Martin Gammelsæter martingammelsae...@gmail.com: Hi! I have a Spark cluster running on top of a Cassandra cluster, using Datastax' new driver, and one of the fields of my RDDs is an XML-string. In a normal Scala sparkjob, parsing that data is no problem, but I would like to also make that information available through Spark SQL. So, is there any way to write user defined functions for Spark SQL? I know that a HiveContext is available, but I understand that that is for querying data from Hive, and I don't have Hive in my stack (please correct me if I'm wrong). I would love to be able to do something like the following: val casRdd = sparkCtx.cassandraTable(ks, cf) // registerAsTable etc val res = sql(SELECT id, xmlGetTag(xmlfield, 'sometag') FROM cf) -- Best regards, Martin Gammelsæter -- Takuya UESHIN Tokyo, Japan http://twitter.com/ueshin -- Mvh. Martin Gammelsæter 92209139
How to use groupByKey and CqlPagingInputFormat
Hi! Total Scala and Spark noob here with a few questions. I am trying to modify a few of the examples in the spark repo to fit my needs, but running into a few problems. I am making an RDD from Cassandra, which I've finally gotten to work, and trying to do some operations on it. Specifically I am trying to do a grouping by key for future calculations. I want the key to be the column navn from a certain column family, but I don't think I understand the returned types. Why are two Maps returned, instead of one? I'd think that you'd get a list of some kind with every row, where every element in the list was a map from column name to the value. So my first question is: What do these maps represent? val casRdd = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[CqlPagingInputFormat], classOf[java.util.Map[String,ByteBuffer]], classOf[java.util.Map[String,ByteBuffer]]) val navnrevmap = casRdd.map({ case (key, value) = (ByteBufferUtil.string(value.get(navn)), ByteBufferUtil.toInt(value.get(revisjon)) }).groupByKey() The second question (probably stemming from my not understanding the first question) is why am I not allowed to do a groupByKey in the above code? I understand that the type does not have that function, but I'm unclear on what I have to do to make it work. -- Best regards, Martin Gammelsæter
How to use .newAPIHadoopRDD() from Java (w/ Cassandra)
Hello! I have just started trying out Spark to see if it fits my needs, but I am running into some issues when trying to port the CassandraCQLTest.scala example into Java. The specific errors etc. that I encounter can be seen here: http://stackoverflow.com/questions/24450540/how-to-use-sparks-newapihadooprdd-java-equivalent-of-scalas-classof where I have also asked the same question. Any pointers on how to use .newAPIHadoopRDD() and CqlPagingInputFormat from Java is greatly appreciated! (Either here or on Stack Overflow) -- Best regards, Martin Gammelsæter