RE: Hive custom transform scripts in Spark?

Yang Wu (Tata Consultancy Services) Tue, 20 Oct 2015 10:37:03 -0700

Yes.
We are trying to run a custom script written in C# using TRANSFORM, but cannot 
get it work.
The query and error are below. Any suggestions? Thank you!


Spark version: 1.3
Here is how we add and invoke the script:

scala> hiveContext.sql("""ADD FILE wasb://… /NSSGraphHelper.exe""")
                …
scala> hiveContext.sql("""SELECT TRANSFORM (dc, attribute, key, time, value) 
USING 'NSSGraphHelper. exe'  FROM SourceTable""").collect()

The query throws an exception that it cannot find the file specified:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 16.0 failed 4 times, most recent fail
ure: Lost task 0.3 in stage 16.0 (TID 1273, 
workernode1.nsssparkcluster.g10.internal.cloudapp.net): java.io.IOException:
Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find 
the file specified
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
        at 
org.apache.spark.sql.hive.execution.ScriptTransformation$$anonfun$1.apply(ScriptTransformation.scala:61)
        at 
org.apache.spark.sql.hive.execution.ScriptTransformation$$anonfun$1.apply(ScriptTransformation.scala:58)
        at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
        at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find 
the file specified
        at java.lang.ProcessImpl.create(Native Method)
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:385)
        at java.lang.ProcessImpl.start(ProcessImpl.java:136)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
        ... 16 more

Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(
DAGScheduler.scala:1204)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at scala.Option.foreach(Option.scala:236)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

From: Michael Armbrust [mailto:[email protected]]
Sent: Tuesday, October 20, 2015 10:21 AM
To: Yang Wu (Tata Consultancy Services) <[email protected]>
Cc: user <[email protected]>
Subject: Re: Hive custom transform scripts in Spark?

We support TRANSFORM.  Are you having a problem using it?

On Tue, Oct 20, 2015 at 8:21 AM, wuyangjack 
<[email protected]<mailto:[email protected]>> wrote:
How to reuse hive custom transform scripts written in python or c++?

These scripts process data from stdin and print to stdout in spark.
They use the Transform Syntax in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fcwiki.apache.org%2fconfluence%2fdisplay%2fHive%2fLanguageManual%2bTransform&data=01%7c01%7cv-wuyang%40microsoft.com%7ca204316fb2bd41492b2708d2d972dde2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=pL6wNubIoOntZPeD%2fGld%2b7ZPm57tpFKw4Q6Ab0YZ%2bV4%3d>

Example in Hive:
SELECT TRANSFORM(stuff)
USING 'script.exe'
AS thing1, thing2



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-custom-transform-scripts-in-Spark-tp25142.html<https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fapache-spark-user-list.1001560.n3.nabble.com%2fHive-custom-transform-scripts-in-Spark-tp25142.html&data=01%7c01%7cv-wuyang%40microsoft.com%7ca204316fb2bd41492b2708d2d972dde2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=E19LCyw%2ft%2b75qAtLbc1lCcOfCG02S8xts3e51HIEVE4%3d>
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: 
[email protected]<mailto:[email protected]>
For additional commands, e-mail: 
[email protected]<mailto:[email protected]>

RE: Hive custom transform scripts in Spark?

Reply via email to