Hi Jörn,
We will be upgrading to MapR 5.1, Hive 1.2, and Spark 1.6.1 at the end of
June.
In the meantime, still can this be done with these versions?
There is not a firewall issue since we have edge nodes and cluster nodes
hosted in the same location with the same NFS mount.
On Thu, May 26, 2016 at 1:34 AM, Jörn Franke wrote:
> Both have outdated versions, usually one can support you better if you
> upgrade to the newest.
> Firewall could be an issue here.
>
>
> On 26 May 2016, at 10:11, Nikolay Voronchikhin
> wrote:
>
> Hi PySpark users,
>
> We need to be able to run large Hive queries in PySpark 1.2.1. Users are
> running PySpark on an Edge Node, and submit jobs to a Cluster that
> allocates YARN resources to the clients.
> We are using MapR as the Hadoop Distribution on top of Hive 0.13 and Spark
> 1.2.1.
>
>
> Currently, our process for writing queries works only for small result
> sets, for example:
> *from pyspark.sql import HiveContext*
> *sqlContext = HiveContext(sc)*
> *results = sqlContext.sql("select column from database.table limit
> 10").collect()*
> *results*
>
>
>
> How do I save the HiveQL query to RDD first, then output the results?
>
> This is the error I get when running a query that requires output of
> 400,000 rows:
> *from pyspark.sql import HiveContext*
> *sqlContext = HiveContext(sc)*
> *results = sqlContext.sql("select column from database.table").collect()*
> *results*
> ...
>
> /path/to/mapr/spark/spark-1.2.1/python/pyspark/sql.py in collect(self) 1976
> """ 1977 with SCCallSiteSync(self.context) as css:-> 1978
> bytesInJava =
> self._jschema_rdd.baseSchemaRDD().collectToPython().iterator() 1979
> cls = _create_cls(self.schema()) 1980 return map(cls,
> self._collect_iterator_through_file(bytesInJava))
> /path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
> in __call__(self, *args)536 answer =
> self.gateway_client.send_command(command)537 return_value =
> get_return_value(answer, self.gateway_client,--> 538
> self.target_id, self.name)539 540 for temp_arg in temp_args:
> /path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
> in get_return_value(answer, gateway_client, target_id, name)298
>raise Py4JJavaError(299 'An error occurred
> while calling {0}{1}{2}.\n'.--> 300 format(target_id,
> '.', name), value)301 else:302 raise
> Py4JError(
> Py4JJavaError: An error occurred while calling o76.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Exception while getting task result: java.io.IOException: Failed to connect
> to cluster_node/IP_address:port
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
>
> For this example, ideally, this query should output the 400,000 row
> resultset.
>
>
> Thanks for your help,
> *Niko