Re: Providing query dsl to Elasticsearch for Spark (2.1.0.Beta3)
Quick follow-up: this works sweetly with spark-1.1.1-bin-hadoop2.4. On Dec 3, 2014, at 3:31 PM, Ian Wilkinson ia...@me.com wrote: Hi, I'm trying the Elasticsearch support for Spark (2.1.0.Beta3). In the following I provide the query (as query dsl): import org.elasticsearch.spark._ object TryES { val sparkConf = new SparkConf().setAppName(Campaigns) sparkConf.set(es.nodes, es_cluster:9200) sparkConf.set(es.nodes.discovery, false) val sc = new SparkContext(sparkConf) def main(args: Array[String]) { val query = { query: { ... } } val campaigns = sc.esRDD(resource, query) campaigns.count(); } } However when I submit this (using spark-1.1.0-bin-hadoop2.4), I am experiencing the following exceptions: 14/12/03 14:55:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/12/03 14:55:27 INFO scheduler.DAGScheduler: Failed to run count at TryES.scala:... Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot open stream for resource { query: { ... } } Is the query dsl supported with esRDD, or am I missing something more fundamental? Huge thanks, ian - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Providing query dsl to Elasticsearch for Spark (2.1.0.Beta3)
Hi, I'm trying the Elasticsearch support for Spark (2.1.0.Beta3). In the following I provide the query (as query dsl): import org.elasticsearch.spark._ object TryES { val sparkConf = new SparkConf().setAppName(Campaigns) sparkConf.set(es.nodes, es_cluster:9200) sparkConf.set(es.nodes.discovery, false) val sc = new SparkContext(sparkConf) def main(args: Array[String]) { val query = { query: { ... } } val campaigns = sc.esRDD(resource, query) campaigns.count(); } } However when I submit this (using spark-1.1.0-bin-hadoop2.4), I am experiencing the following exceptions: 14/12/03 14:55:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/12/03 14:55:27 INFO scheduler.DAGScheduler: Failed to run count at TryES.scala:... Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot open stream for resource { query: { ... } } Is the query dsl supported with esRDD, or am I missing something more fundamental? Huge thanks, ian - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: DynamoDB input source
) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.scheduler.ResultTask.writeExternal(ResultTask.scala:132) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:767) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/07/20 23:56:05 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/07/20 23:56:05 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/07/20 23:56:05 INFO Remoting: Remoting shut down 14/07/20 23:56:05 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. On 4 Jul 2014, at 18:04, Nick Pentreath nick.pentre...@gmail.com wrote: Interesting - I would have thought they would make that available publicly. Unfortunately, unless you can use Spark on EMR, I guess your options are to hack it by spinning up an EMR cluster and getting the JAR, or maybe fall back to using boto and rolling your own :( On Fri, Jul 4, 2014 at 9:28 AM, Ian Wilkinson ia...@me.com wrote: Trying to discover source for the DynamoDBInputFormat. Not appearing in: - https://github.com/aws/aws-sdk-java - https://github.com/apache/hive Then came across http://stackoverflow.com/questions/1704/jar-containing-org-apache-hadoop-hive-dynamodb. Unsure whether this represents the latest situation… ian On 4 Jul 2014, at 16:58, Nick Pentreath nick.pentre...@gmail.com wrote: I should qualify by saying there is boto support for dynamodb - but not for the inputFormat. You could roll your own python-based connection but this involves figuring out how to split the data in dynamo - inputFormat takes care of this so should be the easier approach — Sent from Mailbox On Fri, Jul 4, 2014 at 8:51 AM, Ian Wilkinson ia...@me.com wrote: Excellent. Let me get browsing on this. Huge thanks, ian On 4 Jul 2014, at 16:47, Nick Pentreath nick.pentre...@gmail.com wrote: No boto support for that. In master there is Python support for loading Hadoop inputFormat. Not sure if it will be in 1.0.1 or 1.1 I master docs under the programming guide are instructions and also under examples project there are pyspark examples of using Cassandra and HBase. These should hopefully give you enough to get started. Depending on how easy it is to use the dynamo DB format, you may have to write a custom converter (see the mentioned examples for storm details). Sent from my iPhone On 4 Jul 2014, at 08:38, Ian Wilkinson ia...@me.com wrote: Hi Nick, I’m going to be working with python primarily. Are you aware of comparable boto support? ian On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com wrote: You should be able to use DynamoDBInputFormat (I think this should be part of AWS libraries for Java) and create a HadoopRDD from that. On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote: Hi, I noticed mention of DynamoDB as input source in http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf. Unfortunately, Google is not coming to my rescue on finding further mention for this support. Any pointers would be well received. Big thanks, ian
Problem running Spark shell (1.0.0) on EMR
Hi, I’m trying to run the Spark (1.0.0) shell on EMR and encountering a classpath issue. I suspect I’m missing something gloriously obviously, but so far it is eluding me. I launch the EMR Cluster (using the aws cli) with: aws emr create-cluster --name Test Cluster \ --ami-version 3.0.3 \ --no-auto-terminate \ --ec2-attributes KeyName=... \ --bootstrap-actions Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb \ --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium \ InstanceGroupType=CORE,InstanceCount=1,InstanceType=m1.medium --region eu-west-1 then, $ aws emr ssh --cluster-id ... --key-pair-file ... --region eu-west-1 On the master node, I then launch the shell with: [hadoop@ip-... spark]$ ./bin/spark-shell and try performing: scala val logs = sc.textFile(s3n://.../“) this produces: 14/07/16 12:40:35 WARN storage.BlockManager: Putting block broadcast_0 failed java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; Any help mighty welcome, ian
Re: DynamoDB input source
Hi Nick, I’m going to be working with python primarily. Are you aware of comparable boto support? ian On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com wrote: You should be able to use DynamoDBInputFormat (I think this should be part of AWS libraries for Java) and create a HadoopRDD from that. On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote: Hi, I noticed mention of DynamoDB as input source in http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf. Unfortunately, Google is not coming to my rescue on finding further mention for this support. Any pointers would be well received. Big thanks, ian
Re: DynamoDB input source
Trying to discover source for the DynamoDBInputFormat. Not appearing in: - https://github.com/aws/aws-sdk-java - https://github.com/apache/hive Then came across http://stackoverflow.com/questions/1704/jar-containing-org-apache-hadoop-hive-dynamodb. Unsure whether this represents the latest situation… ian On 4 Jul 2014, at 16:58, Nick Pentreath nick.pentre...@gmail.com wrote: I should qualify by saying there is boto support for dynamodb - but not for the inputFormat. You could roll your own python-based connection but this involves figuring out how to split the data in dynamo - inputFormat takes care of this so should be the easier approach — Sent from Mailbox On Fri, Jul 4, 2014 at 8:51 AM, Ian Wilkinson ia...@me.com wrote: Excellent. Let me get browsing on this. Huge thanks, ian On 4 Jul 2014, at 16:47, Nick Pentreath nick.pentre...@gmail.com wrote: No boto support for that. In master there is Python support for loading Hadoop inputFormat. Not sure if it will be in 1.0.1 or 1.1 I master docs under the programming guide are instructions and also under examples project there are pyspark examples of using Cassandra and HBase. These should hopefully give you enough to get started. Depending on how easy it is to use the dynamo DB format, you may have to write a custom converter (see the mentioned examples for storm details). Sent from my iPhone On 4 Jul 2014, at 08:38, Ian Wilkinson ia...@me.com wrote: Hi Nick, I’m going to be working with python primarily. Are you aware of comparable boto support? ian On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com wrote: You should be able to use DynamoDBInputFormat (I think this should be part of AWS libraries for Java) and create a HadoopRDD from that. On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote: Hi, I noticed mention of DynamoDB as input source in http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf. Unfortunately, Google is not coming to my rescue on finding further mention for this support. Any pointers would be well received. Big thanks, ian