Spark on Hadoop with Java 8
Hi, I am contemplating the use of Hadoop with Java 8 in a production system. I will be using Apache Spark for doing most of the computations on data stored in HBase. Although Hadoop seems to support JDK 8 with some tweaks, the official HBase site states the following for version 0.98, Running with JDK 8 works but is not well tested. Building with JDK 8 would require removal of the deprecated remove() method of the PoolMap class and is under consideration. See ee HBASE-7608 for more information about JDK 8 support. I am inclined towards using JDK 8 specifically for the support of lambda expressions which will take a lot of verbosity out of my Spark programs(Scala learning curve is a deterrent for me as a possible bottleneck for future talent acquisition). Is it a good idea to use Spark/Hadoop/HBase combo with Java 8 at the moment? Thanks, Jatin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Hadoop-with-Java-8-tp12883.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Upgrading 1.0.0 to 1.0.2
Ah, thanks. On Tue, Aug 26, 2014 at 7:32 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Victor, the issue for you to have different version in driver and cluster is that you the master will shutdown your application due to the inconsistent SerialVersionID in ExecutorState Best, -- Nan Zhu On Tuesday, August 26, 2014 at 10:10 PM, Matei Zaharia wrote: Things will definitely compile, and apps compiled on 1.0.0 should even be able to link against 1.0.2 without recompiling. The only problem is if you run your driver with 1.0.0 on its classpath, but the cluster has 1.0.2 in executors. For Mesos and YARN vs standalone, the difference is that they just have more features, at the expense of more complicated setup. For example, they have richer support for cross-application sharing (see https://spark.apache.org/docs/latest/job-scheduling.html), and the ability to run non-Spark applications on the same cluster. Matei On August 26, 2014 at 6:53:33 PM, Victor Tso-Guillen (v...@paxata.com) wrote: Yes, we are standalone right now. Do you have literature why one would want to consider Mesos or YARN for Spark deployments? Sounds like I should try upgrading my project and seeing if everything compiles without modification. Then I can connect to an existing 1.0.0 cluster and see what what happens... Thanks, Matei :) On Tue, Aug 26, 2014 at 6:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Is this a standalone mode cluster? We don't currently make this guarantee, though it will likely work in 1.0.0 to 1.0.2. The problem though is that the standalone mode grabs the executors' version of Spark code from what's installed on the cluster, while your driver might be built against another version. On YARN and Mesos, you can more easily mix different versions of Spark, since each application ships its own Spark JAR (or references one from a URL), and this is used for both the driver and executors. Matei On August 26, 2014 at 6:10:57 PM, Victor Tso-Guillen (v...@paxata.com) wrote: I wanted to make sure that there's full compatibility between minor releases. I have a project that has a dependency on spark-core so that it can be a driver program and that I can test locally. However, when connecting to a cluster you don't necessarily know what version you're connecting to. Is a 1.0.0 cluster binary compatible with a 1.0.2 driver program? Is a 1.0.0 driver program binary compatible with a 1.0.2 cluster?
RE: What is a Block Manager?
The framework have those info to manage cluster status, and these info (e.g. worker number) is also available through spark metrics system. While from the user application's point of view, can you give an example why you need these info, what would you plan to do with them? Best Regards, Raymond Liu From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Wednesday, August 27, 2014 1:40 PM To: Liu, Raymond Cc: user@spark.apache.org Subject: Re: What is a Block Manager? We're a single-app deployment so we want to launch as many executors as the system has workers. We accomplish this by not configuring the max for the application. However, is there really no way to inspect what machines/executor ids/number of workers/etc is available in context? I'd imagine that there'd be something in the SparkContext or in the listener, but all I see in the listener is block managers getting added and removed. Wouldn't one care about the workers getting added and removed at least as much as for block managers? On Tue, Aug 26, 2014 at 6:58 PM, Liu, Raymond raymond@intel.com wrote: Basically, a Block Manager manages the storage for most of the data in spark, name a few: block that represent a cached RDD partition, intermediate shuffle data, broadcast data etc. it is per executor, while in standalone mode, normally, you have one executor per worker. You don't control how many worker you have at runtime, but you can somehow manage how many executors your application will launch Check different running mode's documentation for details ( but control where? Hardly, yarn mode did some works based on data locality, but this is done by framework not user program). Best Regards, Raymond Liu From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Tuesday, August 26, 2014 11:42 PM To: user@spark.apache.org Subject: What is a Block Manager? I'm curious not only about what they do, but what their relationship is to the rest of the system. I find that I get listener events for n block managers added where n is also the number of workers I have available to the application. Is this a stable constant? Also, are there ways to determine at runtime how many workers I have and where they are? Thanks, Victor
Re: CUDA in spark, especially in MLlib?
Maybe this would interest you: CPU and GPU-accelerated Machine Learning Library: https://github.com/BIDData/BIDMach 2014-08-27 4:08 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com: You should try to find a Java-based library, then you can call it from Scala. Matei On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote: Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no active projects on Scala+CUDA. Python is supported by CUDA though. Any suggestion on whether this idea makes any sense? Best regards, Wei
Is there a way to insert data into existing parquet file using spark ?
Hi, *Is there a way to insert data into existing parquet file using spark ?* I am using spark stream and spark sql to store store real time data into parquet files and then query it using impala. spark creating multiple sub directories of parquet files and it make me challenge while loading it to impala. I want to insert data into existing parquet file instead of creating new parquet file. I have tried with INSERT statement but it makes performance too slow. Please suggest is there any way to insert or append data into existing parquet file. Regards, Rafeeq S *(“What you do is what matters, not what you think or say or plan.” )*
Re: Spark Streaming Output to DB
Like Mayur said, its better to use mapPartition instead of map. Here's a piece of code which typically reads a text file and inserts each raw into the database. I haven't tested it, It might throw up some Serialization errors, In that case, you gotta serialize them! JavaRDDString txtRDD = jsc.textFile(/sigmoid/logs.cap); JavaRDDString dummyRDD = txtRDD.map(new FunctionString,String(){ @Override public String call(String raw) throws Exception { * //Creating JDBC Connection* Class.forName(oracle.jdbc.driver.OracleDriver); Driver myDriver = new oracle.jdbc.driver.OracleDriver(); DriverManager.registerDriver( myDriver); String URL = jdbc:oracle:thin:AkhlD/password@akhldz:1521:EMP; Connection conn = DriverManager.getConnection(URL); Statement stmt = conn.createStatement(); * //Do some operations!!* stmt.executeQuery(INSERT INTO logs VALUES(null, ' + raw + '); * //Now close the connection* conn.close(); return raw + [ Added ]; } }); Thanks Best Regards On Wed, Aug 27, 2014 at 10:06 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: I would suggest you to use JDBC connector in mappartition instead of maps as JDBC connections are costly can really impact your performance. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Aug 26, 2014 at 6:45 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Yes, you can open a jdbc connection at the beginning of the map method then close this connection at the end of map() and in between you can use this connection. Thanks Best Regards On Tue, Aug 26, 2014 at 6:12 PM, Ravi Sharma raviprincesha...@gmail.com wrote: Hello People, I'm using java spark streaming. I'm just wondering, Can I make simple jdbc connection in JavaDStream map() method? Or Do I need to create jdbc connection for each JavaPairDStream, after map task? Kindly give your thoughts. Cheers, Ravi Sharma
Developing a spark streaming application
Hey guys, so the problem i'm trying to tackle is the following: - I need a data source that emits messages at a certain frequency - There are N neural nets that need to process each message individually - The outputs from all neural nets are aggregated and only when all N outputs for each message are collected, should a message be declared fully processed - At the end i should measure the time it took for a message to be fully processed (time between when it was emitted and when all N neural net outputs from that message have been collected) What i'm mostly interested in is if i approached the problem correctly in the first place and if so some best practice pointers on my approach. And my current implementation if the following: For a data source i created the class public class JavaRandomReceiver extends ReceiverMaplt;String, Object As i decided a key-value store would be best suited to holding emitted data. The onStart() method initializes a custom random sequence generator and starts a thread that continuously generates new neural net inputs and stores them as following: SensorData sdata = generator.createSensorData(); MapString, Object result = new HashMapString, Object(); result.put(msgNo, sdata.getMsgNo()); result.put(sensorTime, sdata.getSampleTime()); result.put(list, sdata.getPayload()); result.put(timeOfProc, sdata.getCreationTime()); store(result); // sleeps for a given amount of time set at generator creation generator.waitForNextTuple(); The msgNo here is incremented for each newly created message and is used to keep The neural net functionality is added by creating a custom mapper public class NeuralNetMapper implements FunctionMaplt;String, Object, MapString, Object whose call function basically just takes the input map, plugs its list object as the input to the neural net object, replaces the map's initial list with the neural net output and returns the modified map. The aggregator is implemented as a single class that has the following form public class JavaSyncBarrier implements FunctionJavaRDDlt;Maplt;String,Object, Void This class maintains a google guava cache of neural net outputs that it has received in the form of Long, Listlt;Maplt;String, Object, where the Long value is the msgNo and the list contains all maps containing said message number. When a new map is received, it is added to the cache, its list's length is compared to to the total number of neural nets and, if these numbers match, that message number is said to be fully processed and a difference between timeOfProc (all maps with the same msgNo have the same timeOfProc) and the current system time is displayed as the total time necessary for processing. Now the way all these components are linked together is the following: public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName(SimpleSparkStreamingTest); JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(1000)); jssc.checkpoint(/tmp/spark-tempdir); // Generator config goes here // Set to emit new message every 1 second // --- // Neural net config goes here // --- JavaReceiverInputDStreamMaplt;String, Object rndLists = jssc .receiverStream(new JavaRandomReceiver(generatorConfig); ListJavaDStreamlt;Maplt;String, Object neuralNetOutputStreams = new ArrayListJavaDStreamlt;Maplt;String, Object(); for(int i = 0; i numberOfNets; i++){ neuralNetOutputStreams .add( rndLists.map(new NeuralNetMapper(neuralNetConfig)) ); } JavaDStreamMaplt;String, Object joined = joinStreams(neuralNetOutputs); joined.foreach(new JavaSyncBarrier(numberOfNets)); jssc.start(); jssc.awaitTermination(); } where joinStreams unifies a list of streams: public static T JavaDStreamT joinStreams(ListJavaDStreamlt;T streams) { JavaDStreamT result = streams.get(0); for (int i = 1; i streams.size(); i++) { result = result.union(streams.get(i)); } return result; } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Developing-a-spark-streaming-application-tp12893.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
hive on spark yarn
Hi all When I run a simple SQL, encountered the following error. hive:0.12(metastore in mysql) hadoop 2.4.1 spark 1.0.2 build with hive my hql code import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ import org.apache.spark.sql.hive.LocalHiveContext object HqlTest { case class Record(key: Int, value: String) def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(HiveFromSpark) val sc = new SparkContext(sparkConf) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) } } 14/08/27 16:07:08 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/08/27 16:07:08 INFO ObjectStore: ObjectStore, initialize called org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table src at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:958) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:905) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:8999) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8313) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:189) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:163) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:250) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:250) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:75) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:78) at HqlTest$.main(HqlTest.scala:15) at HqlTest.main(HqlTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2372) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2383) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950) ... 26 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1210) ... 31 more Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. NestedThrowables: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:275) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:304) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:234) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:209) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at
Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)
Thank you for your answers, and sorry for my lack of understanding. So I tried what you suggested, with/without unpersisting and with .cache() (also persist(StorageLevel.MEMORY_AND_DISK) but this is not allowed for msg because you can't change the Storage level apparently) for msg, g and newVerts, but it gave the exact same result mentionned in my previous post. For example before failing, in one iteration I get for the innerJoin task the following log : 14/08/27 10:30:58 INFO executor.Executor: Running task ID 37 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_0 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_1 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_2 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_3 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_6 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_7 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_4 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_10 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_11 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_8 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_14 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_15 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_12 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_18 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_19 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_16 locally 14/08/27 10:30:58 INFO storage.BlockManager: Found block rdd_122_0 locally 14/08/27 10:30:58 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/08/27 10:30:58 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 14/08/27 10:30:58 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms If I understand this correctly, this means that it is looking for the whole list of broadcasted variables, even though it should only need the 3 current values... I'm a little confused about this. For the sake of completeness, here is the same portion of code I am using with cache() et unpersist removed (I tried multiple combinations, I also tried without broadcasting variables (e.g. directly feeding the innerJoin with the value instead of the broadcast) : def run(graph : Graph[Long,Long],m : Long)(implicit sc : SparkContext) = { val spd = SparsePD // Init fusinoMap var fusionMap = Map[Long, Long]().withDefault(x = x) // Init tots val tots = Map[Long, Double]().withDefaultValue(1.0) var totBcst = sc.broadcast(tots) var fusionBcst = sc.broadcast(fusionMap) val mC = sc.broadcast(m) // Initial distributions var g = graph.mapVertices({ case (vid, deg) = VertexProp(somedistrib.withDefaultValue(0.0), deg) }) var newVerts = g.vertices //Initial messages var msg = g.mapReduceTriplets(MFExecutor.sendMsgMF, MFExecutor.mergeMsgMF) var iter = 0 while (iter 20) { // Messages val oldMessages = msg val oldVerts = newVerts newVerts = newVerts.innerJoin(msg)(MFExecutor.vprogMF(mC,totBcst,fusionBcst)).persist(StorageLevel.MEMORY_AND_DISK) newVerts.checkpoint() // Tue la lineage de newVerts ? newVerts.count() // Matérialise les deux lignes précédentes val prevG = g g = graph.outerJoinVertices(newVerts)({case (vid,deg,newOpt) = newOpt.getOrElse(VertexProp(Map(vid - 1.0).withDefaultValue(0.0), deg))}).cache() //g = g.outerJoinVertices(newVerts)({case (vid,old,newOpt) = newOpt.getOrElse(old)}) // 1st global var val fusionAcc = sc.accumulable[Map[Long, Long], (Long, Long)](fusionMap)(FusionAccumulable) g.triplets.filter(tp = testEq(fusionBcst)(tp.srcId,tp.dstId) (spd.dotPD(tp.dstAttr.prob, tp.srcAttr.prob) 0.9)).foreach(tp = fusionAcc += (tp.dstId, tp.srcId)) //fusionBcst.unpersist(blocking = false) fusionMap = fusionAcc.value fusionBcst = sc.broadcast(fusionMap) //2nd global var val totAcc = sc.accumulator[Map[Long, Double]](Map[Long, Double]().withDefaultValue(0.0))(TotAccumulable) newVerts.foreach({ case (vid, vprop) = totAcc += vprop.prob.mapValues(p = p * vprop.deg).withDefaultValue(0.0)}) //totBcst.unpersist(blocking = false) totBcst = sc.broadcast(totAcc.value) // New messages msg = g.mapReduceTriplets(MFExecutor.sendMsgMF, MFExecutor.mergeMsgMF).cache() // Unpersist options //oldMessages.unpersist(blocking = false) //oldVerts.unpersist(blocking=false) //prevG.unpersistVertices(blocking=false) iter = iter + 1 }
Example File not running
Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and sbt package is compiling the scala file to jar. I am running it locally on my machine. The error log is huge but it starts with something like this: 14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j profile: org/apache/sp ark/log4j-defaults.properties 14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844 14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(D062844) 14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started 14/08/27 12:14:22 INFO Remoting: Starting remoting 14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spa rk@10.94.74.159:51157] 14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.9 4.74.159:51157] 14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker 14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster 14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at C:\Users\D062844\AppDa ta\Local\Temp\spark-local-20140827121422-dec8 14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with id = ConnectionM anagerId(10.94.74.159,51160) 14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager 14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager 10.94.74.159:51160 with 294.9 MB RAM 14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at http://10.94.74.159:5116 1 14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is C:\Users\D062844\AppD ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO SparkUI: Started SparkUI at http://10.94.74.159:4040 14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your pla tform... using builtin-java classes where applicable 14/08/27 12:14:23 INFO SparkContext: Added JAR file:/C:/Users/D062844/Desktop/HandsOnSpark /Install/spark-1.0.2-bin-hadoop2/target/scala-2.10/simple-project_2.10-1.0.jar at http://1 0.94.74.159:51162/jars/simple-project_2.10-1.0.jar with timestamp 1409134463198 14/08/27 12:14:23 INFO MemoryStore: ensureFreeSpace(138763) called with curMem=0, maxMem=3 09225062 14/08/27 12:14:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimate d size 135.5 KB, free 294.8 MB) 14/08/27 12:14:23 ERROR Shell: Failed to locate the winutils binary in the hadoop binary p ath java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binar ies. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300) at org.apache.hadoop.util.Shell.clinit(Shell.java:293) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362 ) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:145) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.FilteredRDD.getPartitions(FilteredRDD.scala:29) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) .. . It goes on like this and doesn't show me the result of count in the file. I had
Replicate RDDs
Hi I have a three node spark cluster. I restricted the resources per application by setting appropriate parameters and I could run two applications simultaneously. Now, I want to replicate an RDD and run two applications simultaneously. Can someone help how to go about doing this!!! I replicated an RDD of size 1354MB over this cluster. The webUI shows that its replicated twice. But when I go to storage details, the two partitions, each of size ~677MB, are stored on the same node. All other nodes do not contain any partitions. Can someone tell me where am I going wrong? Thank you!! -karthik
Re: Example File not running
The statement java.io.IOException: Could not locate executable null\bin\winutils.exe explains that the null is received when expanding or replacing an Environment Variable. I'm guessing that you are missing *HADOOP_HOME* in the environment variables. Thanks Best Regards On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet vineet.hingor...@sap.com wrote: Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and sbt package is compiling the scala file to jar. I am running it locally on my machine. The error log is huge but it starts with something like this: 14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j profile: org/apache/sp ark/log4j-defaults.properties 14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844 14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(D062844) 14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started 14/08/27 12:14:22 INFO Remoting: Starting remoting 14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spa rk@10.94.74.159:51157] 14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.9 4.74.159:51157] 14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker 14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster 14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at C:\Users\D062844\AppDa ta\Local\Temp\spark-local-20140827121422-dec8 14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with id = ConnectionM anagerId(10.94.74.159,51160) 14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager 14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager 10.94.74.159:51160 with 294.9 MB RAM 14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at http://10.94.74.159:5116 1 14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is C:\Users\D062844\AppD ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO SparkUI: Started SparkUI at http://10.94.74.159:4040 14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your pla tform... using builtin-java classes where applicable 14/08/27 12:14:23 INFO SparkContext: Added JAR file:/C:/Users/D062844/Desktop/HandsOnSpark /Install/spark-1.0.2-bin-hadoop2/target/scala-2.10/simple-project_2.10-1.0.jar at http://1 0.94.74.159:51162/jars/simple-project_2.10-1.0.jar with timestamp 1409134463198 14/08/27 12:14:23 INFO MemoryStore: ensureFreeSpace(138763) called with curMem=0, maxMem=3 09225062 14/08/27 12:14:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimate d size 135.5 KB, free 294.8 MB) 14/08/27 12:14:23 ERROR Shell: Failed to locate the winutils binary in the hadoop binary p ath java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binar ies. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300) at org.apache.hadoop.util.Shell.clinit(Shell.java:293) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362 ) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:145) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at
RE: Installation On Windows machine
Thank you for the reply Matei, Is there something which we missed. ? I am able to run the spark instance on my local system i.e. Windows 7 but the same set of steps do not allow me to run it on Windows server 2012 machine. The black screen just appears for a fraction of second and disappear, I am unable to debug the same. Please guide me. Thanks, Abhishek -Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Saturday, August 23, 2014 9:47 AM To: Mishra, Abhishek Cc: user@spark.apache.org Subject: Re: Installation On Windows machine You should be able to just download / unzip a Spark release and run it on a Windows machine with the provided .cmd scripts, such as bin\spark-shell.cmd. The scripts to launch a standalone cluster (e.g. start-all.sh) won't work on Windows, but you can launch a standalone cluster manually using bin\spark-class org.apache.spark.deploy.master.Master and bin\spark-class org.apache.spark.deploy.worker.Worker spark://master:port For submitting jobs to YARN instead of the standalone cluster, spark-submit.cmd *may* work but I don't think we've tested it heavily. If you find issues with that, please let us know. But overall the instructions should be the same as on Linux, except you use the .cmd scripts instead of the .sh ones. Matei On Aug 22, 2014, at 3:01 AM, Mishra, Abhishek abhishek.mis...@xerox.com wrote: Hello Team, I was just trying to install spark on my windows server 2012 machine and use it in my project; but unfortunately I do not find any documentation for the same. Please let me know if we have drafted anything for spark users on Windows. I am really in need of it as we are using Windows machine for Hadoop and other tools and so cannot move back to Linux OS or anything. We run Hadoop on hortonworks HDP2.0 platform and also recently I came across Spark and so wanted use this even in my project for my Analytics work. Please suggest me links or documents where I can move ahead with my installation and usage. I want to run it on Java. Looking forward for a reply, Thanking you in Advance, Sincerely, Abhishek - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Example File not running
What should I put the value of that environment variable? I want to run the scripts locally on my machine and do not have any Hadoop installed. Thank you From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Mittwoch, 27. August 2014 12:54 To: Hingorani, Vineet Cc: user@spark.apache.org Subject: Re: Example File not running The statement java.io.IOException: Could not locate executable null\bin\winutils.exe explains that the null is received when expanding or replacing an Environment Variable. I'm guessing that you are missing HADOOP_HOME in the environment variables. Thanks Best Regards On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote: Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and sbt package is compiling the scala file to jar. I am running it locally on my machine. The error log is huge but it starts with something like this: 14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j profile: org/apache/sp ark/log4j-defaults.properties 14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844 14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(D062844) 14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started 14/08/27 12:14:22 INFO Remoting: Starting remoting 14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@10.94.74.159:51157http://rk@10.94.74.159:51157] 14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.9 4.74.159:51157] 14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker 14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster 14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at C:\Users\D062844\AppDa ta\Local\Temp\spark-local-20140827121422-dec8 14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with id = ConnectionM anagerId(10.94.74.159,51160) 14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager 14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager 10.94.74.159:51160http://10.94.74.159:51160 with 294.9 MB RAM 14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at http://10.94.74.159:5116 1 14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is C:\Users\D062844\AppD ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO SparkUI: Started SparkUI at http://10.94.74.159:4040 14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your pla tform... using builtin-java classes where applicable 14/08/27 12:14:23 INFO SparkContext: Added JAR file:/C:/Users/D062844/Desktop/HandsOnSpark /Install/spark-1.0.2-bin-hadoop2/target/scala-2.10/simple-project_2.10-1.0.jar at http://10.94.74.159:51162/jars/simple-project_2.10-1.0.jarhttp://0.94.74.159:51162/jars/simple-project_2.10-1.0.jar with timestamp 1409134463198 14/08/27 12:14:23 INFO MemoryStore: ensureFreeSpace(138763) called with curMem=0, maxMem=3 09225062 14/08/27 12:14:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimate d size 135.5 KB, free 294.8 MB) 14/08/27 12:14:23 ERROR Shell: Failed to locate the winutils binary in the hadoop binary p ath java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binar ies. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300) at org.apache.hadoop.util.Shell.clinit(Shell.java:293) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362 ) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:145) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168) at
External dependencies management with spark
Dear all, I'm looking for an efficient way to manage external dependencies. I know that one can add .jar or .py dependencies easily but how can I handle other type of dependencies. Specifically, I have some data processing algorithm implemented with other languages (ruby, octave, matlab, c++) and I need to put all of these libraries component (set of .rb, .m, .so files) on my workers in order to able to call them as external process. What is the best way to achieve this ? Cheers, Jaonary
Re: Example File not running
It should point to your hadoop installation directory. (like C:\hadoop\) Since you don't have hadoop installed, What is the code that you are running? Thanks Best Regards On Wed, Aug 27, 2014 at 4:50 PM, Hingorani, Vineet vineet.hingor...@sap.com wrote: What should I put the value of that environment variable? I want to run the scripts locally on my machine and do not have any Hadoop installed. Thank you *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Mittwoch, 27. August 2014 12:54 *To:* Hingorani, Vineet *Cc:* user@spark.apache.org *Subject:* Re: Example File not running The statement java.io.IOException: Could not locate executable null\bin\winutils.exe explains that the null is received when expanding or replacing an Environment Variable. I'm guessing that you are missing *HADOOP_HOME* in the environment variables. Thanks Best Regards On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet vineet.hingor...@sap.com wrote: Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and sbt package is compiling the scala file to jar. I am running it locally on my machine. The error log is huge but it starts with something like this: 14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j profile: org/apache/sp ark/log4j-defaults.properties 14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844 14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(D062844) 14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started 14/08/27 12:14:22 INFO Remoting: Starting remoting 14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@10.94.74.159:51157] 14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.9 4.74.159:51157] 14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker 14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster 14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at C:\Users\D062844\AppDa ta\Local\Temp\spark-local-20140827121422-dec8 14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with id = ConnectionM anagerId(10.94.74.159,51160) 14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager 14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager 10.94.74.159:51160 with 294.9 MB RAM 14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at http://10.94.74.159:5116 1 14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is C:\Users\D062844\AppD ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO SparkUI: Started SparkUI at http://10.94.74.159:4040 14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your pla tform... using builtin-java classes where applicable 14/08/27 12:14:23 INFO SparkContext: Added JAR file:/C:/Users/D062844/Desktop/HandsOnSpark /Install/spark-1.0.2-bin-hadoop2/target/scala-2.10/simple-project_2.10-1.0.jar at http://10.94.74.159:51162/jars/simple-project_2.10-1.0.jar with timestamp 1409134463198 14/08/27 12:14:23 INFO MemoryStore: ensureFreeSpace(138763) called with curMem=0, maxMem=3 09225062 14/08/27 12:14:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimate d size 135.5 KB, free 294.8 MB) 14/08/27 12:14:23 ERROR Shell: Failed to locate the winutils binary in the hadoop binary p ath java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binar ies. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300) at org.apache.hadoop.util.Shell.clinit(Shell.java:293) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362 ) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at scala.Option.map(Option.scala:145) at
Re: spark and matlab
Thank you Matei. I found a solution using pipe and matlab engine (an executable that can call matlab behind the scene and uses stdin and stdout to communicate). I just need to fix two other issues : - how can I handle my dependencies ? My matlab script need other matlab files that need to be present on each workers' matlab path. So I need a way to push them to each worker and tell matlab where to find them with addpath. I know how to call addpath but I don't know what should be the path. - is the pipe() operator works on a partition level in order to run the external process once for each data in a partition. Initializing my external process cost a lot so it is not good to call it several times. On Mon, Aug 25, 2014 at 9:03 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Have you tried the pipe() operator? It should work if you can launch your script from the command line. Just watch out for any environment variables needed (you can pass them to pipe() as an optional argument if there are some). On August 25, 2014 at 12:41:29 AM, Jaonary Rabarisoa (jaon...@gmail.com) wrote: Hi all, Is there someone that tried to pipe RDD into matlab script ? I'm trying to do something similiar if one of you could point some hints. Best regards, Jao
RE: Example File not running
The code is the example given on Spark site: /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = C:/Users/D062844/Desktop/HandsOnSpark/Install/spark-1.0.2-bin-hadoop2/README.md // Should be some file on your system val conf = new SparkConf().setAppName(Simple Application) val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line = line.contains(a)).count() val numBs = logData.filter(line = line.contains(b)).count() println(Lines with a: %s, Lines with b: %s.format(numAs, numBs)) } } It goes on like this and doesn’t show me the result of count in the file. I had installed pre-built version of Spark1.0.2 named Hadoop2 version Spark on the site. The problem what I think is because I am running it on local machine and it is not able to find some dependencies of Hadoop. Please tell me what file should I download to work on my local machine (pre-built, so that I don’t have to build it again). From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Mittwoch, 27. August 2014 13:35 To: Hingorani, Vineet Cc: user@spark.apache.org Subject: Re: Example File not running It should point to your hadoop installation directory. (like C:\hadoop\) Since you don't have hadoop installed, What is the code that you are running? Thanks Best Regards On Wed, Aug 27, 2014 at 4:50 PM, Hingorani, Vineet vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote: What should I put the value of that environment variable? I want to run the scripts locally on my machine and do not have any Hadoop installed. Thank you From: Akhil Das [mailto:ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com] Sent: Mittwoch, 27. August 2014 12:54 To: Hingorani, Vineet Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Example File not running The statement java.io.IOException: Could not locate executable null\bin\winutils.exe explains that the null is received when expanding or replacing an Environment Variable. I'm guessing that you are missing HADOOP_HOME in the environment variables. Thanks Best Regards On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote: Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and sbt package is compiling the scala file to jar. I am running it locally on my machine. The error log is huge but it starts with something like this: 14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j profile: org/apache/sp ark/log4j-defaults.properties 14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844 14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(D062844) 14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started 14/08/27 12:14:22 INFO Remoting: Starting remoting 14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@10.94.74.159:51157http://rk@10.94.74.159:51157] 14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.9 4.74.159:51157] 14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker 14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster 14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at C:\Users\D062844\AppDa ta\Local\Temp\spark-local-20140827121422-dec8 14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with id = ConnectionM anagerId(10.94.74.159,51160) 14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager 14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager 10.94.74.159:51160http://10.94.74.159:51160 with 294.9 MB RAM 14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at http://10.94.74.159:5116 1 14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is C:\Users\D062844\AppD ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO SparkUI: Started SparkUI at http://10.94.74.159:4040 14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your pla tform... using builtin-java classes where applicable 14/08/27 12:14:23 INFO SparkContext: Added JAR
NotSerializableException while doing rdd.saveToCassandra
Hi All, I am using spark-1.0.0 to parse a json file and save to values to cassandra using case class. My code looks as follows: case class LogLine(x1:Option[String],x2: Option[String],x3:Option[List[String]],x4: Option[String],x5:Option[String],x6:Option[String],x7:Option[String],x8:Option[String],x9:Option[String]) val data = test.map(line = { parse(line) }).map(json = { // Extract the values implicit lazy val formats = org.json4s.DefaultFormats val x1 = (json \ x1).extractOpt[String] val x2 = (json \ x2).extractOpt[String] val x4=(json \ x4).extractOpt[String] val x5=(json \ x5).extractOpt[String] val x6=(json \ x6).extractOpt[String] val x7=(json \ x7).extractOpt[String] val x8=(json \ x8).extractOpt[String] val x3=(json \ x3).extractOpt[List[String]] val x9=(json \ x9).extractOpt[String] LogLine(x1,x2,x3,x4,x5,x6,x7,x8,x9) }) data.saveToCassandra(test, test_data, Seq(x1, x2, x3, x4, x5, x6, x7, x8, x9)) whereas the cassandra table schema is as follows: CREATE TABLE test_data ( x1 varchar, x2 varchar, x4 varchar, x5 varchar, x6 varchar, x7 varchar, x8 varchar, x3 listtext , x9 varchar, PRIMARY KEY (x1)); I am getting the following error on executing the saveToCassandra statement: 14/08/27 11:33:59 INFO SparkContext: Starting job: runJob at package.scala:169 14/08/27 11:33:59 INFO DAGScheduler: Got job 5 (runJob at package.scala:169) with 1 output partitions (allowLocal=false) 14/08/27 11:33:59 INFO DAGScheduler: Final stage: Stage 5(runJob at package.scala:169) 14/08/27 11:33:59 INFO DAGScheduler: Parents of final stage: List() 14/08/27 11:33:59 INFO DAGScheduler: Missing parents: List() 14/08/27 11:33:59 INFO DAGScheduler: Submitting Stage 5 (MappedRDD[7] at map at console:45), which has no missing parents 14/08/27 11:33:59 INFO DAGScheduler: Failed to run runJob at package.scala:169 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkConf data.saveToCassandra(test, test_data, Seq(x1, x2, x3, x4, x5, x6, x7, x8, x9)) Here the data field is org.apache.spark.rdd.RDD[LogLine] = MappedRDD[7] at map at console:45 How can I convert this to Serializable, or is this a different problem? Please advise. Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NotSerializableException-while-doing-rdd-saveToCassandra-tp12906.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Example file not running
Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and sbt package is compiling the scala file to jar. I am running it locally on my machine. It goes on like this and doesn't show me the result of count in the file. I had installed pre-built version of Spark1.0.2 named Hadoop2 version Spark on the site. The problem what I think is because I am running it on local machine and it is not able to find some dependencies of Hadoop. Please tell me what file should I download to work on my local machine (pre-built, so that I don't have to build it again). The error log is huge but it starts with something like this: 14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j profile: org/apache/sp ark/log4j-defaults.properties 14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844 14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(D062844) 14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started 14/08/27 12:14:22 INFO Remoting: Starting remoting 14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@10.94.74.159:51157http://rk@10.94.74.159:51157] 14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.9 4.74.159:51157] 14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker 14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster 14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at C:\Users\D062844\AppDa ta\Local\Temp\spark-local-20140827121422-dec8 14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with id = ConnectionM anagerId(10.94.74.159,51160) 14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager 14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager 10.94.74.159:51160http://10.94.74.159:51160 with 294.9 MB RAM 14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at http://10.94.74.159:5116 1 14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is C:\Users\D062844\AppD ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server 14/08/27 12:14:22 INFO SparkUI: Started SparkUI at http://10.94.74.159:4040 14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your pla tform... using builtin-java classes where applicable 14/08/27 12:14:23 INFO SparkContext: Added JAR file:/C:/Users/D062844/Desktop/HandsOnSpark /Install/spark-1.0.2-bin-hadoop2/target/scala-2.10/simple-project_2.10-1.0.jar at http://10.94.74.159:51162/jars/simple-project_2.10-1.0.jarhttp://0.94.74.159:51162/jars/simple-project_2.10-1.0.jar with timestamp 1409134463198 14/08/27 12:14:23 INFO MemoryStore: ensureFreeSpace(138763) called with curMem=0, maxMem=3 09225062 14/08/27 12:14:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimate d size 135.5 KB, free 294.8 MB) 14/08/27 12:14:23 ERROR Shell: Failed to locate the winutils binary in the hadoop binary p ath java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binar ies. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300) at org.apache.hadoop.util.Shell.clinit(Shell.java:293) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362 ) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:145) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at
How to get prerelease thriftserver working?
(apologies for sending this twice, first via nabble; didn't realize it wouldn't get forwarded) Hey, I know it's not officially released yet, but I'm trying to understand (and run) the Thrift-based JDBC server, in order to enable remote JDBC access to our dev cluster. Before asking about details, is my understanding of this correct? `sbin/start-thriftserver` is a JDBC/Hive server that doesn't require running a Hive+MR cluster (i.e. just Spark/Spark+YARN)? Assuming yes, I have hope that it all basically works, just that some documentation needs to be cleaned up: - I found a release page implying that 1.1 will be released pretty soon-ish: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage - I can find recent (more recent 30 days or so) activity with promising titles: [Updated Spark SQL README to include the hive-thriftserver module](https://github.com/apache/spark/pull/1867), [[SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)]( https://github.com/apache/spark/pull/1620) Am I following all the right email threads, issues trackers, and whatnot? Specifically, I tried: 1. Building off of `branch-1.1`, synced as of ~today (2014 Aug 25) 2. Running `sbin/start-thriftserver.sh` in `yarn-client` mode 3. Can see the processing running, and the spark context/app created in yarn logs, and can connect to the thrift server on the default port of 1 using `bin/beeline` 4. However, when I try to find out what that cluster has via `show tables;`, in the logs I see a connection error to some (what I assume to be) random port. So what service am I forgetting/too ignorant to run? Or did I misunderstand and we do need a live Hive instance to back thriftserver? Or is this a YARN-specific issue? Only recently started learning the ecosystem and community, so apologies for the longer post and lots of questions. :) Matt
RE: Installation On Windows machine
I got it upright Matei, Thank you. I was giving wrong directory path. Thank you...!! Thanks, Abhishek Mishra -Original Message- From: Mishra, Abhishek [mailto:abhishek.mis...@xerox.com] Sent: Wednesday, August 27, 2014 4:38 PM To: Matei Zaharia Cc: user@spark.apache.org Subject: RE: Installation On Windows machine Thank you for the reply Matei, Is there something which we missed. ? I am able to run the spark instance on my local system i.e. Windows 7 but the same set of steps do not allow me to run it on Windows server 2012 machine. The black screen just appears for a fraction of second and disappear, I am unable to debug the same. Please guide me. Thanks, Abhishek -Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Saturday, August 23, 2014 9:47 AM To: Mishra, Abhishek Cc: user@spark.apache.org Subject: Re: Installation On Windows machine You should be able to just download / unzip a Spark release and run it on a Windows machine with the provided .cmd scripts, such as bin\spark-shell.cmd. The scripts to launch a standalone cluster (e.g. start-all.sh) won't work on Windows, but you can launch a standalone cluster manually using bin\spark-class org.apache.spark.deploy.master.Master and bin\spark-class org.apache.spark.deploy.worker.Worker spark://master:port For submitting jobs to YARN instead of the standalone cluster, spark-submit.cmd *may* work but I don't think we've tested it heavily. If you find issues with that, please let us know. But overall the instructions should be the same as on Linux, except you use the .cmd scripts instead of the .sh ones. Matei On Aug 22, 2014, at 3:01 AM, Mishra, Abhishek abhishek.mis...@xerox.com wrote: Hello Team, I was just trying to install spark on my windows server 2012 machine and use it in my project; but unfortunately I do not find any documentation for the same. Please let me know if we have drafted anything for spark users on Windows. I am really in need of it as we are using Windows machine for Hadoop and other tools and so cannot move back to Linux OS or anything. We run Hadoop on hortonworks HDP2.0 platform and also recently I came across Spark and so wanted use this even in my project for my Analytics work. Please suggest me links or documents where I can move ahead with my installation and usage. I want to run it on Java. Looking forward for a reply, Thanking you in Advance, Sincerely, Abhishek - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Example File not running
You can install hadoop 2 by reading this doc https://wiki.apache.org/hadoop/Hadoop2OnWindows Once you are done with it, you can set the environment variable HADOOP_HOME then it should work. Also Not sure if it will work, but can you provide file:// at the front and give it a go? I don't see any requirement for hadoop here. /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = *file://C:/Users/D062844/Desktop/HandsOnSpark/Install/spark-1.0.2-bin-hadoop2/README.md* // Should be some file on your system val conf = new SparkConf().setAppName(Simple Application) val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line = line.contains(a)).count() val numBs = logData.filter(line = line.contains(b)).count() println(Lines with a: %s, Lines with b: %s.format(numAs, numBs)) } } Thanks Best Regards On Wed, Aug 27, 2014 at 5:09 PM, Hingorani, Vineet vineet.hingor...@sap.com wrote: The code is the example given on Spark site: /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = C:/Users/D062844/Desktop/HandsOnSpark/Install/spark-1.0.2-bin-hadoop2/README.md // Should be some file on your system val conf = new SparkConf().setAppName(Simple Application) val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line = line.contains(a)).count() val numBs = logData.filter(line = line.contains(b)).count() println(Lines with a: %s, Lines with b: %s.format(numAs, numBs)) } } It goes on like this and doesn’t show me the result of count in the file. I had installed pre-built version of Spark1.0.2 named Hadoop2 version Spark on the site. The problem what I think is because I am running it on local machine and it is not able to find some dependencies of Hadoop. Please tell me what file should I download to work on my local machine (pre-built, so that I don’t have to build it again). *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Mittwoch, 27. August 2014 13:35 *To:* Hingorani, Vineet *Cc:* user@spark.apache.org *Subject:* Re: Example File not running It should point to your hadoop installation directory. (like C:\hadoop\) Since you don't have hadoop installed, What is the code that you are running? Thanks Best Regards On Wed, Aug 27, 2014 at 4:50 PM, Hingorani, Vineet vineet.hingor...@sap.com wrote: What should I put the value of that environment variable? I want to run the scripts locally on my machine and do not have any Hadoop installed. Thank you *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Mittwoch, 27. August 2014 12:54 *To:* Hingorani, Vineet *Cc:* user@spark.apache.org *Subject:* Re: Example File not running The statement java.io.IOException: Could not locate executable null\bin\winutils.exe explains that the null is received when expanding or replacing an Environment Variable. I'm guessing that you are missing *HADOOP_HOME* in the environment variables. Thanks Best Regards On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet vineet.hingor...@sap.com wrote: Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and sbt package is compiling the scala file to jar. I am running it locally on my machine. The error log is huge but it starts with something like this: 14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j profile: org/apache/sp ark/log4j-defaults.properties 14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844 14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(D062844) 14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started 14/08/27 12:14:22 INFO Remoting: Starting remoting 14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@10.94.74.159:51157] 14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.9 4.74.159:51157] 14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker 14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster 14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at C:\Users\D062844\AppDa ta\Local\Temp\spark-local-20140827121422-dec8 14/08/27 12:14:22 INFO MemoryStore: MemoryStore started
RE: Example File not running
It didn’t work after adding file:// in the front. I compiled it again and ran it. The same error are coming. Do you think there can be some problem with the java dependency? Also, I don’t want to install Hadoop I just want to run it on local machine. The reason is, whenever I install these things they don’t run due to some dependencies and then I have to give time to what dependencies are needed. Thank you for helping but it is depressing that I am not even able to run a simple example with Spark. Regards, Vineet From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Mittwoch, 27. August 2014 16:01 To: Hingorani, Vineet Cc: user@spark.apache.org Subject: Re: Example File not running You can install hadoop 2 by reading this doc https://wiki.apache.org/hadoop/Hadoop2OnWindows Once you are done with it, you can set the environment variable HADOOP_HOME then it should work. Also Not sure if it will work, but can you provide file:// at the front and give it a go? I don't see any requirement for hadoop here. /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = file://C:/Users/D062844/Desktop/HandsOnSpark/Install/spark-1.0.2-bin-hadoop2/README.mdfile:///C:\Users\D062844\Desktop\HandsOnSpark\Install\spark-1.0.2-bin-hadoop2\README.md // Should be some file on your system val conf = new SparkConf().setAppName(Simple Application) val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line = line.contains(a)).count() val numBs = logData.filter(line = line.contains(b)).count() println(Lines with a: %s, Lines with b: %s.format(numAs, numBs)) } } Thanks Best Regards On Wed, Aug 27, 2014 at 5:09 PM, Hingorani, Vineet vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote: The code is the example given on Spark site: /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = C:/Users/D062844/Desktop/HandsOnSpark/Install/spark-1.0.2-bin-hadoop2/README.md // Should be some file on your system val conf = new SparkConf().setAppName(Simple Application) val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line = line.contains(a)).count() val numBs = logData.filter(line = line.contains(b)).count() println(Lines with a: %s, Lines with b: %s.format(numAs, numBs)) } } It goes on like this and doesn’t show me the result of count in the file. I had installed pre-built version of Spark1.0.2 named Hadoop2 version Spark on the site. The problem what I think is because I am running it on local machine and it is not able to find some dependencies of Hadoop. Please tell me what file should I download to work on my local machine (pre-built, so that I don’t have to build it again). From: Akhil Das [mailto:ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com] Sent: Mittwoch, 27. August 2014 13:35 To: Hingorani, Vineet Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Example File not running It should point to your hadoop installation directory. (like C:\hadoop\) Since you don't have hadoop installed, What is the code that you are running? Thanks Best Regards On Wed, Aug 27, 2014 at 4:50 PM, Hingorani, Vineet vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote: What should I put the value of that environment variable? I want to run the scripts locally on my machine and do not have any Hadoop installed. Thank you From: Akhil Das [mailto:ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com] Sent: Mittwoch, 27. August 2014 12:54 To: Hingorani, Vineet Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Example File not running The statement java.io.IOException: Could not locate executable null\bin\winutils.exe explains that the null is received when expanding or replacing an Environment Variable. I'm guessing that you are missing HADOOP_HOME in the environment variables. Thanks Best Regards On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote: Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and sbt package is compiling the scala file to
Re: Does HiveContext support Parquet?
What Spark and Hadoop versions are you on? I have it working in my Spark app with the parquet-hive-bundle-1.5.0.jar bundled into my app fat-jar. I¹m running Spark 1.0.2 and CDH5. bin/spark-shell --master local[*] --driver-class-path ~/parquet-hive-bundle-1.5.0.jar To see if that works? On 8/26/14, 6:03 PM, lyc yanchen@huawei.com wrote: Hi Silvio, I re-downloaded hive-0.12-bin and reset the related path in spark-env.sh. However, I still got some error. Do you happen to know any step I did wrong? Thank you! My detailed step is as follows: #enter spark-shell (successful) /bin/spark-shell --master spark://S4:7077 --jars /home/hduser/parquet-hive-bundle-1.5.0.jar #import related hiveContext (successful) ... # create parquet table: hql(CREATE TABLE parquet_test (id int, str string, mp MAPSTRING,STRING, lst ARRAYSTRING, strct STRUCTA:STRING,B:STRING) PARTITIONED BY (part string) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat') get error: 14/08/26 21:59:20 ERROR exec.DDLTask: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetPrimitiveInspe ctorFactory at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.ge tObjectInspector(ArrayWritableObjectInspector.java:77) at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.i nit(ArrayWritableObjectInspector.java:59) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(Par quetHiveSerDe.java:113) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreU tils.java:218) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Tabl e.java:272) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:265) at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:597) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:576) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3661) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65 ) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:186) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:160) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(Hive Context.scala:250) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.sca la:247) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:85) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:90) at $line34.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:18) at $line34.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:23) at $line34.$read$$iwC$$iwC$$iwC$$iwC.init(console:25) at $line34.$read$$iwC$$iwC$$iwC.init(console:27) at $line34.$read$$iwC$$iwC.init(console:29) at $line34.$read$$iwC.init(console:31) at $line34.$read.init(console:33) at $line34.$read$.init(console:37) at $line34.$read$.clinit(console) at $line34.$eval$.init(console:7) at $line34.$eval$.clinit(console) at $line34.$eval.$print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm pl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:84 1) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at
Saddle structure in Spark
Hello everyone, Is it possible to use an external data structure, such as Saddle, in Spark? As far as I know, a RDD is a kind of wrapper or container that has certain data structure inside. So I was wondering whether this data structure has to be either a basic (or native) structure or any available structure. Thanks in advance fellows -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Saddle-structure-in-Spark-tp12916.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming Output to DB
Thank you Akhil and Mayur. It will be really helpful. Thanks, On 27 Aug 2014 13:19, Akhil Das ak...@sigmoidanalytics.com wrote: Like Mayur said, its better to use mapPartition instead of map. Here's a piece of code which typically reads a text file and inserts each raw into the database. I haven't tested it, It might throw up some Serialization errors, In that case, you gotta serialize them! JavaRDDString txtRDD = jsc.textFile(/sigmoid/logs.cap); JavaRDDString dummyRDD = txtRDD.map(new FunctionString,String(){ @Override public String call(String raw) throws Exception { * //Creating JDBC Connection* Class.forName(oracle.jdbc.driver.OracleDriver); Driver myDriver = new oracle.jdbc.driver.OracleDriver(); DriverManager.registerDriver( myDriver); String URL = jdbc:oracle:thin:AkhlD/password@akhldz:1521:EMP; Connection conn = DriverManager.getConnection(URL); Statement stmt = conn.createStatement(); * //Do some operations!!* stmt.executeQuery(INSERT INTO logs VALUES(null, ' + raw + '); * //Now close the connection * conn.close(); return raw + [ Added ]; } }); Thanks Best Regards On Wed, Aug 27, 2014 at 10:06 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: I would suggest you to use JDBC connector in mappartition instead of maps as JDBC connections are costly can really impact your performance. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Aug 26, 2014 at 6:45 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Yes, you can open a jdbc connection at the beginning of the map method then close this connection at the end of map() and in between you can use this connection. Thanks Best Regards On Tue, Aug 26, 2014 at 6:12 PM, Ravi Sharma raviprincesha...@gmail.com wrote: Hello People, I'm using java spark streaming. I'm just wondering, Can I make simple jdbc connection in JavaDStream map() method? Or Do I need to create jdbc connection for each JavaPairDStream, after map task? Kindly give your thoughts. Cheers, Ravi Sharma
Reference Accounts Large Node Deployments
All, Does anyone have specific references to customers, use cases and large-scale deployments of Spark Streaming? By OElarge scale¹ I mean both through-put and number of nodes. I¹m attempting an objective comparison of Streaming and Storm and while this data is known for Storm, there appears to be little for Spark Streaming. If you know of any such deployments, please post them here because I am sure I¹m not the only one wondering about this. If customer confidentially prevents mentioning them by name, consider identifying them by industry, e.g. OEtelco doing X with streaming using Y nodes¹. Any information at all will be welcome. I¹ll feed back a summary and/or update a wiki page once I collate the information. Cheers, - Steve -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Execute HiveFormSpark ERROR.
As suggested in the error messages, double-check your class path. From: CharlieLin chury...@gmail.commailto:chury...@gmail.com Date: Tuesday, August 26, 2014 at 8:29 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Execute HiveFormSpark ERROR. hi, all : I tried to use Spark SQL on spark-shell, as the spark-example. When I execute : val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) then report error like below: scala hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) 14/08/27 11:08:19 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING) 14/08/27 11:08:19 INFO ParseDriver: Parse Completed 14/08/27 11:08:19 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/08/27 11:08:19 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences 14/08/27 11:08:19 INFO Analyzer: Max iterations (2) reached for batch Check Analysis 14/08/27 11:08:19 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange 14/08/27 11:08:19 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions 14/08/27 11:08:19 INFO Driver: PERFLOG method=Driver.run 14/08/27 11:08:19 INFO Driver: PERFLOG method=TimeToSubmit 14/08/27 11:08:19 INFO Driver: PERFLOG method=compile 14/08/27 11:08:19 INFO Driver: PERFLOG method=parse 14/08/27 11:08:19 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING) 14/08/27 11:08:19 INFO ParseDriver: Parse Completed 14/08/27 11:08:19 INFO Driver: /PERFLOG method=parse start=1409108899822 end=1409108899822 duration=0 14/08/27 11:08:19 INFO Driver: PERFLOG method=semanticAnalyze 14/08/27 11:08:19 INFO SemanticAnalyzer: Starting Semantic Analysis 14/08/27 11:08:19 INFO SemanticAnalyzer: Creating table src position=27 14/08/27 11:08:19 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/08/27 11:08:19 INFO ObjectStore: ObjectStore, initialize called 14/08/27 11:08:20 WARN General: Plugin (Bundle) org.datanucleus is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL file:/home/spark/spark/lib_managed/jars/datanucleus-core-3.2.2.jarfile:/home/spark/spark/lib_managed/jars/datanucleus-core-3.2.2.jar is already registered, and you are trying to register an identical plugin located at URL file:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-core-3.2.2.jar.file:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-core-3.2.2.jar. 14/08/27 11:08:20 WARN General: Plugin (Bundle) org.datanucleus.store.rdbms is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL file:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-rdbms-3.2.1.jarfile:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-rdbms-3.2.1.jar is already registered, and you are trying to register an identical plugin located at URL file:/home/spark/spark/lib_managed/jars/datanucleus-rdbms-3.2.1.jar.file:/home/spark/spark/lib_managed/jars/datanucleus-rdbms-3.2.1.jar. 14/08/27 11:08:20 WARN General: Plugin (Bundle) org.datanucleus.api.jdo is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL file:/home/spark/spark/lib_managed/jars/datanucleus-api-jdo-3.2.1.jarfile:/home/spark/spark/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar is already registered, and you are trying to register an identical plugin located at URL file:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar.file:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar. 14/08/27 11:08:20 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table src at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:958) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:905) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:8999) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8313) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:189) at
Re: What is a Block Manager?
I have long-lived state I'd like to maintain on the executors that I'd like to initialize during some bootstrap phase and to update the master when such executor leaves the cluster. On Tue, Aug 26, 2014 at 11:18 PM, Liu, Raymond raymond@intel.com wrote: The framework have those info to manage cluster status, and these info (e.g. worker number) is also available through spark metrics system. While from the user application's point of view, can you give an example why you need these info, what would you plan to do with them? Best Regards, Raymond Liu From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Wednesday, August 27, 2014 1:40 PM To: Liu, Raymond Cc: user@spark.apache.org Subject: Re: What is a Block Manager? We're a single-app deployment so we want to launch as many executors as the system has workers. We accomplish this by not configuring the max for the application. However, is there really no way to inspect what machines/executor ids/number of workers/etc is available in context? I'd imagine that there'd be something in the SparkContext or in the listener, but all I see in the listener is block managers getting added and removed. Wouldn't one care about the workers getting added and removed at least as much as for block managers? On Tue, Aug 26, 2014 at 6:58 PM, Liu, Raymond raymond@intel.com wrote: Basically, a Block Manager manages the storage for most of the data in spark, name a few: block that represent a cached RDD partition, intermediate shuffle data, broadcast data etc. it is per executor, while in standalone mode, normally, you have one executor per worker. You don't control how many worker you have at runtime, but you can somehow manage how many executors your application will launch Check different running mode's documentation for details ( but control where? Hardly, yarn mode did some works based on data locality, but this is done by framework not user program). Best Regards, Raymond Liu From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Tuesday, August 26, 2014 11:42 PM To: user@spark.apache.org Subject: What is a Block Manager? I'm curious not only about what they do, but what their relationship is to the rest of the system. I find that I get listener events for n block managers added where n is also the number of workers I have available to the application. Is this a stable constant? Also, are there ways to determine at runtime how many workers I have and where they are? Thanks, Victor
RE: Save an RDD to a SQL Database
I have similar requirement to export the data to mysql. Just wanted to know what the best approach is so far after the research you guys have done. Currently thinking of saving to hdfs and use sqoop to handle export. Is that the best approach or is there any other way to write to mysql? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p12921.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: CUDA in spark, especially in MLlib?
Thank you all. Actually I was looking at JCUDA. Function wise this may be a perfect solution to offload computation to GPU. Will see how performance it will be, especially with the Java binding. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From: Chen He airb...@gmail.com To: Antonio Jesus Navarro ajnava...@stratio.com, Cc: Matei Zaharia matei.zaha...@gmail.com, user@spark.apache.org, Wei Tan/Watson/IBM@IBMUS Date: 08/27/2014 11:03 AM Subject:Re: CUDA in spark, especially in MLlib? JCUDA can let you do that in Java http://www.jcuda.org On Wed, Aug 27, 2014 at 1:48 AM, Antonio Jesus Navarro ajnava...@stratio.com wrote: Maybe this would interest you: CPU and GPU-accelerated Machine Learning Library: https://github.com/BIDData/BIDMach 2014-08-27 4:08 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com: You should try to find a Java-based library, then you can call it from Scala. Matei On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote: Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no active projects on Scala+CUDA. Python is supported by CUDA though. Any suggestion on whether this idea makes any sense? Best regards, Wei
Re: Out of memory on large RDDs
I have the same issue (I'm using the latest 1.1.0-SNAPSHOT). I've increased my driver memory to 30G, executor memory to 10G, and spark.akka.askTimeout to 180. Still no good. My other configurations are: spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.mb 256 spark.shuffle.consolidateFiles true spark.shuffle.file.buffer.kb400 spark.akka.frameSize500 spark.akka.timeout 600 spark.akka.askTimeout 180 spark.core.connection.auth.wait.timeout 300 However I just got informed that the YARN cluster I'm using *throttles the resource for default queue. *Not sure if it's related. Jianshi On Wed, Aug 27, 2014 at 5:15 AM, Andrew Ash and...@andrewash.com wrote: Hi Grega, Did you ever get this figured out? I'm observing the same issue in Spark 1.0.2. For me it was after 1.5hr of a large .distinct call, followed by a .saveAsTextFile() 14/08/26 20:57:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18500 14/08/26 20:57:43 INFO executor.Executor: Running task ID 18500 14/08/26 20:57:43 INFO storage.BlockManager: Found block broadcast_0 locally 14/08/26 20:57:43 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them 14/08/26 20:58:13 ERROR executor.Executor: Exception in task ID 18491 org.apache.spark.SparkException: Error communicating with MapOutputTracker at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:108) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:155) at org.apache.spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:105) ... 23 more On Tue, Mar 11, 2014 at 3:07 PM, Grega Kespret gr...@celtra.com wrote: Your input data read as RDD may be causing OOM, so thats where you can use different memory configuration. We are not getting any OOM exceptions, just akka future timeouts in mapoutputtracker and unsuccessful get of shuffle outputs, therefore refetching them. What is the industry practice when going about debugging such errors? Questions: - why are mapoutputtrackers timing out? ( and how to debug this properly?) - what is the task/purpose of mapoutputtracker? - how to check per-task objects size? Thanks, Grega On 11 Mar 2014, at 18:43, Mayur Rustagi mayur.rust...@gmail.com wrote: Shuffle data is always stored on disk, its unlikely to cause OOM. Your input data read as RDD may be causing OOM, so thats where you can use different memory configuration. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Mar 11, 2014 at 9:20 AM, sparrow do...@celtra.com wrote: I don't understand how exactly will that help.
Re: Specifying classpath
I solved this issue by putting hbase-protobuf in Hadoop classpath, and not in the spark classpath. export HADOOP_CLASSPATH=/path/to/jar/hbase-protocol-0.98.1-cdh5.1.0.jar On Tue, Aug 26, 2014 at 5:42 PM, Ashish Jain ashish@gmail.com wrote: Hello, I'm using the following version of Spark - 1.0.0+cdh5.1.0+41 (1.cdh5.1.0.p0.27). I've tried to specify the libraries Spark uses using the following ways - 1) Adding it to spark context 2) Specifying the jar path in a) spark.executor.extraClassPath b) spark.executor.extraLibraryPath 3) Copying the libraries to spark/lib 4) Specifying the path in SPARK_CLASSPATH, even SPARK_LIBRARY_PATH 5) Passing as --jars argument but the spark application is not able to pick up the libraries although, I can see the message SparkContext: Added JAR file. I get a NoClassDef found error. The only way I've been to able to make it work right now is by merging my application jar with all the library jars. What might be going on? I need it right now to specify hbase-protocol-0.98.1-cdh5.1.0.jar in SPARK_CLASSPATH as mentioned here https://issues.apache.org/jira/browse/HBASE-10877. I'm using spark-submit to submit the job Thanks Ashish
Re: Issue Connecting to HBase in spark shell
It looks like the issue I had is that I didn't pull in htrace-core jar into the spark class path. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-Connecting-to-HBase-in-spark-shell-tp12855p12924.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: CUDA in spark, especially in MLlib?
Hi Wei, Please keep us posted about the performance result you get. This would be very helpful. Best, Xiangrui On Wed, Aug 27, 2014 at 10:33 AM, Wei Tan w...@us.ibm.com wrote: Thank you all. Actually I was looking at JCUDA. Function wise this may be a perfect solution to offload computation to GPU. Will see how performance it will be, especially with the Java binding. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From:Chen He airb...@gmail.com To:Antonio Jesus Navarro ajnava...@stratio.com, Cc:Matei Zaharia matei.zaha...@gmail.com, user@spark.apache.org, Wei Tan/Watson/IBM@IBMUS Date:08/27/2014 11:03 AM Subject:Re: CUDA in spark, especially in MLlib? JCUDA can let you do that in Java http://www.jcuda.org On Wed, Aug 27, 2014 at 1:48 AM, Antonio Jesus Navarro ajnava...@stratio.com wrote: Maybe this would interest you: CPU and GPU-accelerated Machine Learning Library: https://github.com/BIDData/BIDMach 2014-08-27 4:08 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com: You should try to find a Java-based library, then you can call it from Scala. Matei On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote: Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no active projects on Scala+CUDA. Python is supported by CUDA though. Any suggestion on whether this idea makes any sense? Best regards, Wei - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.1. doesn't work with hive context
It is my mistake, some how I have added the io.compression.codec property value as the above mentioned class. Resolved the problem now Thanks and Regards, Sankar S. On Wednesday, 27 August 2014, 1:23, S Malligarjunan smalligarju...@yahoo.com wrote: Hello all, I have just checked out branch-1.1 and executed below command ./bin/spark-shell --driver-memory 1G val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) hiveContext.hql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src) // Queries are expressed in HiveQL hiveContext.hql(FROM src SELECT key, value).collect().foreach(println) I am getting the following exception Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found. at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135) at org.apache.hadoop.io.compress.CompressionCodecFactory.init(CompressionCodecFactory.java:175) at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45) ... 72 more Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found Thanks and Regards, Sankar S.
Re: Low Level Kafka Consumer for Spark
Hi Dibyendu, That would be great. One of the biggest drawback of Kafka utils as well as your implementation is I am unable to scale out processing. I am relatively new to Spark and Spark Streaming - from what I read and what I observe with my deployment is that having the RDD created on one receiver is processed by at most 2 nodes in my cluster (most likely because default replication is 2 and spark schedules processing close to where the data is). I tried rdd.replicate() to no avail. Would Chris and your proposal to have union of DStreams for all these Receivers still allow scaling out subsequent processing? Thanks, Bharat On Tue, Aug 26, 2014 at 10:59 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Thanks Chris and Bharat for your inputs. I agree, running multiple receivers/dstreams is desirable for scalability and fault tolerant. and this is easily doable. In present KafkaReceiver I am creating as many threads for each kafka topic partitions, but I can definitely create multiple KafkaReceivers for every partition. As Chris mentioned , in this case I need to then have union of DStreams for all these Receivers. I will try this out and let you know. Dib On Wed, Aug 27, 2014 at 9:10 AM, Chris Fregly ch...@fregly.com wrote: great work, Dibyendu. looks like this would be a popular contribution. expanding on bharat's question a bit: what happens if you submit multiple receivers to the cluster by creating and unioning multiple DStreams as in the kinesis example here: https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L123 for more context, the kinesis implementation above uses the Kinesis Client Library (KCL) to automatically assign - and load balance - stream shards among all KCL threads from all receivers (potentially coming and going as nodes die) on all executors/nodes using DynamoDB as the association data store. ZooKeeper would be used for your Kafka consumers, of course. and ZooKeeper watches to handle the ephemeral nodes. and I see you're using Curator, which makes things easier. as bharat suggested, running multiple receivers/dstreams may be desirable from a scalability and fault tolerance standpoint. is this type of load balancing possible among your different Kafka consumers running in different ephemeral JVMs? and isn't it fun proposing a popular piece of code? the question floodgates have opened! haha. :) -chris On Tue, Aug 26, 2014 at 7:29 AM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Hi Bharat, Thanks for your email. If the Kafka Reader worker process dies, it will be replaced by different machine, and it will start consuming from the offset where it left over ( for each partition). Same case can happen even if I tried to have individual Receiver for every partition. Regards, Dibyendu On Tue, Aug 26, 2014 at 5:43 AM, bharatvenkat bvenkat.sp...@gmail.com wrote: I like this consumer for what it promises - better control over offset and recovery from failures. If I understand this right, it still uses single worker process to read from Kafka (one thread per partition) - is there a way to specify multiple worker processes (on different machines) to read from Kafka? Maybe one worker process for each partition? If there is no such option, what happens when the single machine hosting the Kafka Reader worker process dies and is replaced by a different machine (like in cloud)? Thanks, Bharat -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12788.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Amplab: big-data-benchmark
Hi All, I am planning to run amplab benchmark suite to evaluate the performance of our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions about data avallability at: s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able to doanload these datasets directly. Here is what I see. I read that they can be used directly by doing : sc.textFile(s3:/). However, I wanted to make sure that my understanding is correct. Here is what I see at http://s3.amazonaws.com/big-data-benchmark/ I do not see anything for sequence or text-deflate. I see sequence-snappy dataset: ContentsKeypavlo/sequence-snappy/5nodes/crawl/000738_0/KeyLastModified2013-05-27T21:26:40.000Z/LastModifiedETaga978d18721d5a533d38a88f558461644/ETagSize42958735/SizeStorageClassSTANDARD/StorageClass/Contents For text, I get the following error: ErrorCodeNoSuchKey/CodeMessageThe specified key does not exist./MessageKeypavlo/text/1node/crawl/KeyRequestId166D239D38399526/RequestIdHostId4Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI/HostId/Error Please let me know if there is a way to readily download the dataset and view it.
Re: Amplab: big-data-benchmark
Hi Sameer, I've faced this issue before. They don't show up on http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: `sc.textFile(s3n://big-data-benchmark/pavlo/text/tiny/crawl)` The gotcha is that you also need to supply which dataset you want: crawl, uservisits, or rankings in lower case after the format and size you want them in. They should be there. Best, Burak - Original Message - From: Sameer Tilak ssti...@live.com To: user@spark.apache.org Sent: Wednesday, August 27, 2014 11:42:28 AM Subject: Amplab: big-data-benchmark Hi All, I am planning to run amplab benchmark suite to evaluate the performance of our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions about data avallability at: s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able to doanload these datasets directly. Here is what I see. I read that they can be used directly by doing : sc.textFile(s3:/). However, I wanted to make sure that my understanding is correct. Here is what I see at http://s3.amazonaws.com/big-data-benchmark/ I do not see anything for sequence or text-deflate. I see sequence-snappy dataset: ContentsKeypavlo/sequence-snappy/5nodes/crawl/000738_0/KeyLastModified2013-05-27T21:26:40.000Z/LastModifiedETaga978d18721d5a533d38a88f558461644/ETagSize42958735/SizeStorageClassSTANDARD/StorageClass/Contents For text, I get the following error: ErrorCodeNoSuchKey/CodeMessageThe specified key does not exist./MessageKeypavlo/text/1node/crawl/KeyRequestId166D239D38399526/RequestIdHostId4Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI/HostId/Error Please let me know if there is a way to readily download the dataset and view it. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does HiveContext support Parquet?
Thanks a lot. Finally, I can create parquet table using your command -driver-class-path. I am using hadoop 2.3. Now, I will try to load data into the tables. Thanks, lyc -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12931.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Amplab: big-data-benchmark
Hi Burak,Thanks, I will then start benchmarking the cluster. Date: Wed, 27 Aug 2014 11:52:05 -0700 From: bya...@stanford.edu To: ssti...@live.com CC: user@spark.apache.org Subject: Re: Amplab: big-data-benchmark Hi Sameer, I've faced this issue before. They don't show up on http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: `sc.textFile(s3n://big-data-benchmark/pavlo/text/tiny/crawl)` The gotcha is that you also need to supply which dataset you want: crawl, uservisits, or rankings in lower case after the format and size you want them in. They should be there. Best, Burak - Original Message - From: Sameer Tilak ssti...@live.com To: user@spark.apache.org Sent: Wednesday, August 27, 2014 11:42:28 AM Subject: Amplab: big-data-benchmark Hi All, I am planning to run amplab benchmark suite to evaluate the performance of our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions about data avallability at: s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able to doanload these datasets directly. Here is what I see. I read that they can be used directly by doing : sc.textFile(s3:/). However, I wanted to make sure that my understanding is correct. Here is what I see at http://s3.amazonaws.com/big-data-benchmark/ I do not see anything for sequence or text-deflate. I see sequence-snappy dataset: ContentsKeypavlo/sequence-snappy/5nodes/crawl/000738_0/KeyLastModified2013-05-27T21:26:40.000Z/LastModifiedETaga978d18721d5a533d38a88f558461644/ETagSize42958735/SizeStorageClassSTANDARD/StorageClass/Contents For text, I get the following error: ErrorCodeNoSuchKey/CodeMessageThe specified key does not exist./MessageKeypavlo/text/1node/crawl/KeyRequestId166D239D38399526/RequestIdHostId4Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI/HostId/Error Please let me know if there is a way to readily download the dataset and view it. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does HiveContext support Parquet?
I'll note the parquet jars are included by default in 1.1 On Wed, Aug 27, 2014 at 11:53 AM, lyc yanchen@huawei.com wrote: Thanks a lot. Finally, I can create parquet table using your command -driver-class-path. I am using hadoop 2.3. Now, I will try to load data into the tables. Thanks, lyc -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12931.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to get prerelease thriftserver working?
I would expect that to work. What exactly is the error? On Wed, Aug 27, 2014 at 6:02 AM, Matt Chu m...@kabam.com wrote: (apologies for sending this twice, first via nabble; didn't realize it wouldn't get forwarded) Hey, I know it's not officially released yet, but I'm trying to understand (and run) the Thrift-based JDBC server, in order to enable remote JDBC access to our dev cluster. Before asking about details, is my understanding of this correct? `sbin/start-thriftserver` is a JDBC/Hive server that doesn't require running a Hive+MR cluster (i.e. just Spark/Spark+YARN)? Assuming yes, I have hope that it all basically works, just that some documentation needs to be cleaned up: - I found a release page implying that 1.1 will be released pretty soon-ish: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage - I can find recent (more recent 30 days or so) activity with promising titles: [Updated Spark SQL README to include the hive-thriftserver module](https://github.com/apache/spark/pull/1867), [[SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)]( https://github.com/apache/spark/pull/1620) Am I following all the right email threads, issues trackers, and whatnot? Specifically, I tried: 1. Building off of `branch-1.1`, synced as of ~today (2014 Aug 25) 2. Running `sbin/start-thriftserver.sh` in `yarn-client` mode 3. Can see the processing running, and the spark context/app created in yarn logs, and can connect to the thrift server on the default port of 1 using `bin/beeline` 4. However, when I try to find out what that cluster has via `show tables;`, in the logs I see a connection error to some (what I assume to be) random port. So what service am I forgetting/too ignorant to run? Or did I misunderstand and we do need a live Hive instance to back thriftserver? Or is this a YARN-specific issue? Only recently started learning the ecosystem and community, so apologies for the longer post and lots of questions. :) Matt
RE: Execution time increasing with increase of cluster size
Can you tell which nodes were doing the computation in each case? Date: Wed, 27 Aug 2014 20:29:38 +0530 Subject: Execution time increasing with increase of cluster size From: sarathchandra.jos...@algofusiontech.com To: user@spark.apache.org Hi, I've written a simple scala program which reads a file on HDFS (which is a delimited file having 100 fields and 1 million rows), splits each row with delimiter, deduces hashcode of each field, makes new rows with these hashcodes and writes these rows back to HDFS. Code attached. When I run this on spark cluster of 2 nodes (these 2 nodes also act as HDFS cluster) it took about 35sec to complete. Then I increased the cluster to 4 nodes (additional nodes are not part of HDFS cluster) and submitted the same job. I was expecting a decrease in the execution time but instead it took 3 times more time (1.6 min) to complete. Attached snapshots of the execution summary. Both the times I've set executor memory to 6GB which is available in all the nodes. What am I'm missing here? Do I need to do any additional configuration when increasing the cluster size? ~Sarath - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: hive on spark yarn
You need to have the datanuclus jars on your classpath. It is not okay to merge them into an uber jar. On Wed, Aug 27, 2014 at 1:44 AM, centerqi hu cente...@gmail.com wrote: Hi all When I run a simple SQL, encountered the following error. hive:0.12(metastore in mysql) hadoop 2.4.1 spark 1.0.2 build with hive my hql code import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ import org.apache.spark.sql.hive.LocalHiveContext object HqlTest { case class Record(key: Int, value: String) def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(HiveFromSpark) val sc = new SparkContext(sparkConf) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) } } 14/08/27 16:07:08 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/08/27 16:07:08 INFO ObjectStore: ObjectStore, initialize called org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table src at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:958) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:905) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:8999) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8313) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:189) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:163) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:250) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:250) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:75) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:78) at HqlTest$.main(HqlTest.scala:15) at HqlTest.main(HqlTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2372) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2383) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950) ... 26 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1210) ... 31 more Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. NestedThrowables: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:275) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:304) at
Spark N.C.
Looking for fellow Spark enthusiasts based in and around Research Triangle Park, Raleigh, Durham, and Chapel Hill, North Carolina Please get in touch off list for an employment opportunity. Must be local. Thanks! -Andrew - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MLBase status
Hi All,I was wondering can someone please tell me the status of MLbase and its roadmap in terms of software release. We are very interested in exploring it for our applications.
[Streaming] Cannot get executors to stay alive
Hi, I tried a similar question before and didn't get any answers,so I'll try again: I am using updateStateByKey, pretty much exactly as shown in the examples shipping with Spark: def createContext(master:String,dropDir:String, checkpointDirectory:String) = { val updateFunc = (values: Seq[Int], state: Option[Int]) = { val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } val sparkConf = new SparkConf() .setMaster(master) .setAppName(StatefulNetworkWordCountNoMem) .setJars(StreamingContext.jarOfClass(this.getClass)); val ssc = new StreamingContext(sparkConf, Seconds(10)) ssc.checkpoint(checkpointDirectory) val lines = ssc.textFileStream(dropDir) val wordDstream = lines.map(line ={ val x = line.split(\t) ((x(0), x(1),x(2)), 1) }) wordDstream.print() val stateDstream = wordDstream.updateStateByKey[Int](updateFunc) println(Printing stream) stateDstream.print() ssc } I am running this over a _static_ drop directory (i.e. files are there at the beginning of time, nothing new is coming). The only difference is that I have a pretty wide key space -- about 440K, each key is of length under 20 chars. The code exactly as shown above runs for a while (about an hour) and then the executors start dying with OOM exceptions. I tried executor memory from 512M to 2g (4 executors)-- the only difference is how long it takes for the OOM. The only task that keeps executing is take at DStream(line 586)...which makes sense. What doesn't make sense is why the memory isn't getting reclaimed What am I doing wrong? I'd like to use streaming but it seems that I can't get a process with 0 incoming traffic to stay up. Any advice much appreciated -- I'm sure someone on this list has managed to run a streaming program for longer than an hour!
Re: disable log4j for spark-shell
You just have to tell Spark which log4j properties file to use. I think --driver-java-options=-Dlog4j.configuration=log4j.properties should work but it didn't for me. set SPARK_JAVA_OPTS=-Dlog4j.configuration=log4j.properties did work though (this was on Windows, in local mode, assuming you put a file called log4j.properties in the bin directory where you run the shell from). In either case, like Sean said -Dlog4j.configuration=... is the magic incantation, you just have to figure out how to pass it depending on what version of SPARK you're using ( I personally find setting PRINT_SPARK_LAUNCH_COMMAND=1 very very useful when I'm trying to track a property) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p12942.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Historic data and clocks
Hi, In an attempt to keep processing logic as simple as possible, I'm trying to use spark streaming for processing historic as well as real-time data. This works quite well, using big intervals that match the window size for historic data, and small intervals for real-time. I found this discussion on a (very) similar situation: https://groups.google.com/forum/#!topic/spark-users/ES8X1l_xn5s To be able to do this historic data processing, I need access to the ManualClock that's used by the job generator. That's unfortunately only accessible with reflection. Since I couldn't find JIRA or GitHub issues, I wonder if there's perhaps a better alternative solution, or if it would be worthwhile improving support for manual time management? cheers, Frank -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
Re: CUDA in spark, especially in MLlib?
you could try looking at ScalaCL[1], it's targeting OpenCL rather than CUDA, but that might be close enough? cheers, Frank 1. https://github.com/ochafik/ScalaCL On Wed, Aug 27, 2014 at 7:33 PM, Wei Tan w...@us.ibm.com wrote: Thank you all. Actually I was looking at JCUDA. Function wise this may be a perfect solution to offload computation to GPU. Will see how performance it will be, especially with the Java binding. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center *http://researcher.ibm.com/person/us-wtan* http://researcher.ibm.com/person/us-wtan From:Chen He airb...@gmail.com To:Antonio Jesus Navarro ajnava...@stratio.com, Cc:Matei Zaharia matei.zaha...@gmail.com, user@spark.apache.org, Wei Tan/Watson/IBM@IBMUS Date:08/27/2014 11:03 AM Subject:Re: CUDA in spark, especially in MLlib? -- JCUDA can let you do that in Java *http://www.jcuda.org* http://www.jcuda.org/ On Wed, Aug 27, 2014 at 1:48 AM, Antonio Jesus Navarro *ajnava...@stratio.com* ajnava...@stratio.com wrote: Maybe this would interest you: CPU and GPU-accelerated Machine Learning Library: *https://github.com/BIDData/BIDMach* https://github.com/BIDData/BIDMach 2014-08-27 4:08 GMT+02:00 Matei Zaharia *matei.zaha...@gmail.com* matei.zaha...@gmail.com: You should try to find a Java-based library, then you can call it from Scala. Matei On August 26, 2014 at 6:58:11 PM, Wei Tan (*w...@us.ibm.com* w...@us.ibm.com) wrote: Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no active projects on Scala+CUDA. Python is supported by CUDA though. Any suggestion on whether this idea makes any sense? Best regards, Wei -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
SchemaRDD
i feel like SchemaRDD has usage beyond just sql. perhaps it belongs in core?
Spark Streaming: DStream - zipWithIndex
Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.
Re: Spark Streaming: DStream - zipWithIndex
No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
minPartitions ignored for bz2?
Hi, I'm running on the master branch and I noticed that textFile ignores minPartition for bz2 files. Is anyone else seeing the same thing? I tried varying minPartitions for a bz2 file and rdd.partitions.size was always 1 whereas doing it for a non-bz2 file worked. Not sure if this matters or not but I'm reading from s3. - jerry -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/minPartitions-ignored-for-bz2-tp12948.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming: DStream - zipWithIndex
So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.
Re: Spark Streaming: DStream - zipWithIndex
You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: minPartitions ignored for bz2?
Are you using hadoop-1.0? Hadoop doesn't support splittable bz2 files before 1.2 (or a later version). But due to a bug (https://issues.apache.org/jira/browse/HADOOP-10614), you should try hadoop-2.5.0. -Xiangrui On Wed, Aug 27, 2014 at 2:49 PM, jerryye jerr...@gmail.com wrote: Hi, I'm running on the master branch and I noticed that textFile ignores minPartition for bz2 files. Is anyone else seeing the same thing? I tried varying minPartitions for a bz2 file and rdd.partitions.size was always 1 whereas doing it for a non-bz2 file worked. Not sure if this matters or not but I'm reading from s3. - jerry -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/minPartitions-ignored-for-bz2-tp12948.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SchemaRDD
I think this will increasingly be its role, though it doesn't make sense to use it to core because it is clearly just a client of the core APIs. What usage do you have in mind in particular? It would be nice to know how the non-SQL APIs for this could be better. Matei On August 27, 2014 at 2:31:45 PM, Koert Kuipers (ko...@tresata.com) wrote: i feel like SchemaRDD has usage beyond just sql. perhaps it belongs in core?
Re: Spark Streaming: DStream - zipWithIndex
Thanks. Just to double check, rdd.id would be unique for a batch in a DStream? On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.
Re: How to get prerelease thriftserver working?
Hey Matt, if you want to access existing Hive data, you still need a to run a Hive metastore service, and provide a proper hive-site.xml (just drop it in $SPARK_HOME/conf). Could you provide the error log you saw? On Wed, Aug 27, 2014 at 12:09 PM, Michael Armbrust mich...@databricks.com wrote: I would expect that to work. What exactly is the error? On Wed, Aug 27, 2014 at 6:02 AM, Matt Chu m...@kabam.com wrote: (apologies for sending this twice, first via nabble; didn't realize it wouldn't get forwarded) Hey, I know it's not officially released yet, but I'm trying to understand (and run) the Thrift-based JDBC server, in order to enable remote JDBC access to our dev cluster. Before asking about details, is my understanding of this correct? `sbin/start-thriftserver` is a JDBC/Hive server that doesn't require running a Hive+MR cluster (i.e. just Spark/Spark+YARN)? Assuming yes, I have hope that it all basically works, just that some documentation needs to be cleaned up: - I found a release page implying that 1.1 will be released pretty soon-ish: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage - I can find recent (more recent 30 days or so) activity with promising titles: [Updated Spark SQL README to include the hive-thriftserver module](https://github.com/apache/spark/pull/1867), [[SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)]( https://github.com/apache/spark/pull/1620) Am I following all the right email threads, issues trackers, and whatnot? Specifically, I tried: 1. Building off of `branch-1.1`, synced as of ~today (2014 Aug 25) 2. Running `sbin/start-thriftserver.sh` in `yarn-client` mode 3. Can see the processing running, and the spark context/app created in yarn logs, and can connect to the thrift server on the default port of 1 using `bin/beeline` 4. However, when I try to find out what that cluster has via `show tables;`, in the logs I see a connection error to some (what I assume to be) random port. So what service am I forgetting/too ignorant to run? Or did I misunderstand and we do need a live Hive instance to back thriftserver? Or is this a YARN-specific issue? Only recently started learning the ecosystem and community, so apologies for the longer post and lots of questions. :) Matt
Re: Spark Streaming: DStream - zipWithIndex
Yeah - each batch will produce a new RDD. On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Thanks. Just to double check, rdd.id would be unique for a batch in a DStream? On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
FileNotFoundException (No space left on device) writing to S3
Hello, I've been seeing the following errors when trying to save to S3: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage fail ure: Task 4058 in stage 2.1 failed 4 times, most recent failure: Lost task 4058.3 in stag e 2.1 (TID 12572, ip-10-81-151-40.ec2.internal): java.io.FileNotFoundException: /mnt/spa$ k/spark-local-20140827191008-05ae/0c/shuffle_1_7570_5768 (No space left on device) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175$ org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuff$ eWriter.scala:67) org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuff$ eWriter.scala:65) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65$ org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) DF tells me there is plenty of space left on the worker node: root@ip-10-81-151-40 ~]$ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 4.6G 3.3G 59% / tmpfs 7.4G 0 7.4G 0% /dev/shm /dev/xvdb 37G 11G 25G 30% /mnt /dev/xvdf 37G 9.5G 26G 27% /mnt2 Any suggestions? Dan
Re: Spark Streaming: DStream - zipWithIndex
I see a issue here. If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. I wish there was DStream mapPartitionsWithIndex. On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.
SparkSQL returns ArrayBuffer for fields of type Array
Hi, Michael. I used HiveContext to create a table with a field of type Array. However, in the hql results, this field was returned as type ArrayBuffer which is mutable. Would it make more sense to be an Array? The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext nor newer version of Spark yet. Thanks, Du
Re: Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra
I'm not so sure that your error is coming from the cassandra write. you have val data = test.map(..).map(..) so data will actually not get created until you try to save it. Can you try to do something like data.count() or data.take(k) after this line and see if you even get to the cassandra part? My suspicion is that you're trying to access something (SparkConf?) within the map closures... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Cassandra-NotSerializable-Exception-while-saving-to-cassandra-tp12906p12960.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL returns ArrayBuffer for fields of type Array
Arrays in the JVM are also mutable. However, you should not be relying on the exact type here. The only promise is that you will get back something of type Seq[_]. On Wed, Aug 27, 2014 at 4:27 PM, Du Li l...@yahoo-inc.com wrote: Hi, Michael. I used HiveContext to create a table with a field of type Array. However, in the hql results, this field was returned as type ArrayBuffer which is mutable. Would it make more sense to be an Array? The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext nor newer version of Spark yet. Thanks, Du
Re: SparkSQL returns ArrayBuffer for fields of type Array
I found this discrepancy when writing unit tests for my project. Basically the expectation was that the returned type should match that of the input data. Although it’s easy to work around, I was just feeling a bit weird. Is there a better reason to return ArrayBuffer? From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Date: Wednesday, August 27, 2014 at 5:21 PM To: Du Li l...@yahoo-inc.commailto:l...@yahoo-inc.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL returns ArrayBuffer for fields of type Array Arrays in the JVM are also mutable. However, you should not be relying on the exact type here. The only promise is that you will get back something of type Seq[_]. On Wed, Aug 27, 2014 at 4:27 PM, Du Li l...@yahoo-inc.commailto:l...@yahoo-inc.com wrote: Hi, Michael. I used HiveContext to create a table with a field of type Array. However, in the hql results, this field was returned as type ArrayBuffer which is mutable. Would it make more sense to be an Array? The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext nor newer version of Spark yet. Thanks, Du
Kafka stream receiver stops input
Hi, I have Spark (1.0.0 on CDH5) running with Kafka 0.8.1.1. I have a streaming jobs that reads from a kafka topic and writes output to another kafka topic. The job starts fine but after a while the input stream stops getting any data. I think these messages show no incoming data on the stream: 14/08/28 00:42:15 INFO ReceiverTracker: Stream 0 received 0 blocks I run the job as: spark-submit --class logStreamNormalizer --master yarn log-stream-normalizer_2.10-1.0.jar --jars spark-streaming-kafka_2.10-1.0.2.jar,kafka_2.10-0.8.1.1.jar,zkclient-0.jar,metrics-core-2.2.0.jar,json4s-jackson_2.10-3.2.10.jar --executor-memory 6G --spark.cleaner.ttl 60 --executor-cores 4 As soon as I start the job, I see an error like: 14/08/28 00:50:59 INFO BlockManagerInfo: Added input-0-1409187056800 in memory on node6-acme.com:39418 (size: 83.3 MB, free: 3.1 GB) Exception in thread pool-1-thread-7 java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:85) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:42) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:662) at org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:504) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) But not sure if that is the cause because even after that OOM message, I see data coming in: 14/08/28 00:51:00 INFO ReceiverTracker: Stream 0 received 6 blocks Appreciate any pointers or suggestions to troubleshoot the issue. Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL returns ArrayBuffer for fields of type Array
In general the various language interfaces try to return the natural type for the language. In python we return lists in scala we return Seqs. Arrays on the JVM have all sorts of messy semantics (e.g. they are invariant and don't have erasure). On Wed, Aug 27, 2014 at 5:34 PM, Du Li l...@yahoo-inc.com wrote: I found this discrepancy when writing unit tests for my project. Basically the expectation was that the returned type should match that of the input data. Although it’s easy to work around, I was just feeling a bit weird. Is there a better reason to return ArrayBuffer? From: Michael Armbrust mich...@databricks.com Date: Wednesday, August 27, 2014 at 5:21 PM To: Du Li l...@yahoo-inc.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: SparkSQL returns ArrayBuffer for fields of type Array Arrays in the JVM are also mutable. However, you should not be relying on the exact type here. The only promise is that you will get back something of type Seq[_]. On Wed, Aug 27, 2014 at 4:27 PM, Du Li l...@yahoo-inc.com wrote: Hi, Michael. I used HiveContext to create a table with a field of type Array. However, in the hql results, this field was returned as type ArrayBuffer which is mutable. Would it make more sense to be an Array? The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext nor newer version of Spark yet. Thanks, Du
Re: MLBase status
Hi Sameer, MLbase started out as a set of three ML components on top of Spark. The lowest level, MLlib, is now a rapidly growing component within Spark and is maintained by the Spark community. The two higher-level components (MLI and MLOpt) are experimental components that serve as testbeds for MLlib. A prototype of MLI was released last fall, and since then many ideas from this prototype and the associated paper http://www.cs.berkeley.edu/~ameet/mlinterface_icdm.pdf have been (or are in the process of being) ported to MLlib. MLOpt, which ultimately aims to automate the task of ML pipeline construction, is an active area of research in the AMPLab. We are still exploring various research ideas, and given the nature of research, it's not clear when we will be ready to release any code. For more details check out mlbase.org, my talk https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf at the Spark Summit about MLlib (slides 5, 6, 33, 34 describe MLbase), and Evan's slides http://spark-summit.org/wp-content/uploads/2014/07/Ghostface-Towards-an-Optimizer-for-MLBase-Sparks-Talkwalkar-Franklin-Jordan-Kraska1.pdf on the current status of MLOpt. -Ameet On Wed, Aug 27, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi All, I was wondering can someone please tell me the status of MLbase and its roadmap in terms of software release. We are very interested in exploring it for our applications.
Re: Low Level Kafka Consumer for Spark
I agree. This issue should be fixed in Spark rather rely on replay of Kafka messages. Dib On Aug 28, 2014 6:45 AM, RodrigoB rodrigo.boav...@aspect.com wrote: Dibyendu, Tnks for getting back. I believe you are absolutely right. We were under the assumption that the raw data was being computed again and that's not happening after further tests. This applies to Kafka as well. The issue is of major priority fortunately. Regarding your suggestion, I would maybe prefer to have the problem resolved within Spark's internals since once the data is replicated we should be able to access it once more and not having to pool it back again from Kafka or any other stream that is being affected by this issue. If for example there is a big amount of batches to be recomputed I would rather have them done distributed than overloading the batch interval with huge amount of Kafka messages. I do not have yet enough know how on where is the issue and about the internal Spark code so I can't really how much difficult will be the implementation. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12966.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Compilaon Error: Spark 1.0.2 with HBase 0.98
Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = 0.98.2 edit pom.xml with following values hadoop.version2.4.1/hadoop.version protobuf.version2.5.0/protobuf.version yarn.version${hadoop.version}/yarn.version hbase.version0.98.5/hbase.version zookeeper.version3.4.6/zookeeper.version hive.version0.13.1/hive.version SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hbase#hbase;0.98.2: not found [warn] :: sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) at sbt.IvySbt$$anon$3.call(Ivy.scala:60) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102) at xsbt.boot.Using$.withResource(Using.scala:11) at xsbt.boot.Using$.apply(Using.scala:10) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:60) at sbt.IvySbt.withIvy(Ivy.scala:101) at sbt.IvySbt.withIvy(Ivy.scala:97) at sbt.IvySbt$Module.withModule(Ivy.scala:116) at sbt.IvyActions$.update(IvyActions.scala:125) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42) at sbt.std.Transform$$anon$4.work(System.scala:64) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) at sbt.Execute.work(Execute.scala:244) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160) at sbt.CompletionService$$anon$2.call(CompletionService.scala:30) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) [error] (examples/*:update) sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found [error]
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: (correction: Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = 0.98.2 edit pom.xml with following values hadoop.version2.4.1/hadoop.version protobuf.version2.5.0/protobuf.version yarn.version${hadoop.version}/yarn.version hbase.version0.98.5/hbase.version zookeeper.version3.4.6/zookeeper.version hive.version0.13.1/hive.version SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hbase#hbase;0.98.2: not found [warn] :: sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) at sbt.IvySbt$$anon$3.call(Ivy.scala:60) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102) at xsbt.boot.Using$.withResource(Using.scala:11) at xsbt.boot.Using$.apply(Using.scala:10) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:60) at sbt.IvySbt.withIvy(Ivy.scala:101) at sbt.IvySbt.withIvy(Ivy.scala:97) at sbt.IvySbt$Module.withModule(Ivy.scala:116) at sbt.IvyActions$.update(IvyActions.scala:125) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42) at sbt.std.Transform$$anon$4.work(System.scala:64) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) at sbt.Execute.work(Execute.scala:244) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160) at sbt.CompletionService$$anon$2.call(CompletionService.scala:30) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138)
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote: See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: (correction: Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = 0.98.2 edit pom.xml with following values hadoop.version2.4.1/hadoop.version protobuf.version2.5.0/protobuf.version yarn.version${hadoop.version}/yarn.version hbase.version0.98.5/hbase.version zookeeper.version3.4.6/zookeeper.version hive.version0.13.1/hive.version SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hbase#hbase;0.98.2: not found [warn] :: sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) at sbt.IvySbt$$anon$3.call(Ivy.scala:60) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102) at xsbt.boot.Using$.withResource(Using.scala:11) at xsbt.boot.Using$.apply(Using.scala:10) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:60) at sbt.IvySbt.withIvy(Ivy.scala:101) at sbt.IvySbt.withIvy(Ivy.scala:97) at sbt.IvySbt$Module.withModule(Ivy.scala:116) at sbt.IvyActions$.update(IvyActions.scala:125) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42) at sbt.std.Transform$$anon$4.work(System.scala:64) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) at sbt.Execute.work(Execute.scala:244) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160) at sbt.CompletionService$$anon$2.call(CompletionService.scala:30) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at
RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)
Update: I use shell script to execute the spark-shell, inside the my-script.sh: $SPARK_HOME/bin/spark-shell $HOME/test.scala $HOME/test.log 21 Although it correctly finish the println(hallo world), but the strange thing is that my-script.sh finished before spark-shell even finish executing the script. Best regards, Henry Hung From: MA33 YTHung1 Sent: Thursday, August 28, 2014 10:01 AM To: user@spark.apache.org Subject: how to correctly run scala script using spark-shell through stdin (spark v1.0.0) HI All, Right now I'm trying to execute a script using this command: nohup $SPARK_HOME/bin/spark-shell $HOME/my-script.scala $HOME/my-script.log 21 my-script.scala just have 1 line of code: println(hallo world) But after waiting for a minute, I still don't receive the result from spark-shell. And the output log only contains: Spark assembly has been built with Hive, including Datanucleus jars on classpath SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data1/hadoop/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data1/hadoop/hbase-0.96.0-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/08/28 09:38:27 INFO spark.SecurityManager: Changing view acls to: hadoop 14/08/28 09:38:27 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop) 14/08/28 09:38:27 INFO spark.HttpServer: Starting HTTP Server 14/08/28 09:38:27 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/08/28 09:38:27 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:49382mailto:SocketConnector@0.0.0.0:49382 My question is: What is the right way to execute spark script? I really want to use spark-shell as a way to run cron job in the future, just like when I run R script. Best regards, Henry The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Winbond is strictly prohibited; and any information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond. The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Winbond is strictly prohibited; and any information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond.
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
You can get the patch from this URL: https://github.com/apache/spark/pull/1893.patch BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml Cheers On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote: See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: (correction: Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = 0.98.2 edit pom.xml with following values hadoop.version2.4.1/hadoop.version protobuf.version2.5.0/protobuf.version yarn.version${hadoop.version}/yarn.version hbase.version0.98.5/hbase.version zookeeper.version3.4.6/zookeeper.version hive.version0.13.1/hive.version SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hbase#hbase;0.98.2: not found [warn] :: sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) at sbt.IvySbt$$anon$3.call(Ivy.scala:60) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102) at xsbt.boot.Using$.withResource(Using.scala:11) at xsbt.boot.Using$.apply(Using.scala:10) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:60) at sbt.IvySbt.withIvy(Ivy.scala:101) at sbt.IvySbt.withIvy(Ivy.scala:97) at sbt.IvySbt$Module.withModule(Ivy.scala:116) at sbt.IvyActions$.update(IvyActions.scala:125) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42) at sbt.std.Transform$$anon$4.work(System.scala:64) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) at sbt.Execute.work(Execute.scala:244) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? Regards Arthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 wget https://github.com/apache/spark/pull/1893.patch patch 1893.patch patching file pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej patching file pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 FAILED at 171. 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej can't find file to patch at input line 267 Perhaps you should have used the -p or --strip option? The text leading up to this was: -- | |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001 |From: tedyu yuzhih...@gmail.com |Date: Mon, 11 Aug 2014 15:57:46 -0700 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add | description to building-with-maven.md | |--- | docs/building-with-maven.md | 3 +++ | 1 file changed, 3 insertions(+) | |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- a/docs/building-with-maven.md |+++ b/docs/building-with-maven.md -- File to patch: On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote: You can get the patch from this URL: https://github.com/apache/spark/pull/1893.patch BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml Cheers On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote: See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: (correction: Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = 0.98.2 edit pom.xml with following values hadoop.version2.4.1/hadoop.version protobuf.version2.5.0/protobuf.version yarn.version${hadoop.version}/yarn.version hbase.version0.98.5/hbase.version zookeeper.version3.4.6/zookeeper.version hive.version0.13.1/hive.version SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hbase#hbase;0.98.2: not found [warn] :: sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) at sbt.IvySbt$$anon$3.call(Ivy.scala:60) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102) at xsbt.boot.Using$.withResource(Using.scala:11) at xsbt.boot.Using$.apply(Using.scala:10) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:60) at sbt.IvySbt.withIvy(Ivy.scala:101) at sbt.IvySbt.withIvy(Ivy.scala:97) at sbt.IvySbt$Module.withModule(Ivy.scala:116) at sbt.IvyActions$.update(IvyActions.scala:125) at
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
Can you use this command ? patch -p1 -i 1893.patch Cheers On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? Regards Arthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 wget https://github.com/apache/spark/pull/1893.patch patch 1893.patch patching file pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej patching file pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 FAILED at 171. 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej can't find file to patch at input line 267 Perhaps you should have used the -p or --strip option? The text leading up to this was: -- | |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001 |From: tedyu yuzhih...@gmail.com |Date: Mon, 11 Aug 2014 15:57:46 -0700 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add | description to building-with-maven.md | |--- | docs/building-with-maven.md | 3 +++ | 1 file changed, 3 insertions(+) | |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- a/docs/building-with-maven.md |+++ b/docs/building-with-maven.md -- File to patch: On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote: You can get the patch from this URL: https://github.com/apache/spark/pull/1893.patch BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml Cheers On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote: See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: (correction: Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = 0.98.2 edit pom.xml with following values hadoop.version2.4.1/hadoop.version protobuf.version2.5.0/protobuf.version yarn.version${hadoop.version}/yarn.version hbase.version0.98.5/hbase.version zookeeper.version3.4.6/zookeeper.version hive.version0.13.1/hive.version SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hbase#hbase;0.98.2: not found [warn] :: sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) at sbt.IvySbt$$anon$3.call(Ivy.scala:60) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102) at xsbt.boot.Using$.withResource(Using.scala:11) at xsbt.boot.Using$.apply(Using.scala:10) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:60) at sbt.IvySbt.withIvy(Ivy.scala:101) at sbt.IvySbt.withIvy(Ivy.scala:97) at
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
Hi Ted, Thanks. Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.) Is this normal? Regards Arthur patch -p1 -i 1893.patch patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file examples/pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 succeeded at 122 (offset -49 lines). 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 succeeded at 122 (offset -40 lines). Hunk #2 succeeded at 195 (offset -40 lines). On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote: Can you use this command ? patch -p1 -i 1893.patch Cheers On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? Regards Arthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 wget https://github.com/apache/spark/pull/1893.patch patch 1893.patch patching file pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej patching file pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 FAILED at 171. 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej can't find file to patch at input line 267 Perhaps you should have used the -p or --strip option? The text leading up to this was: -- | |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001 |From: tedyu yuzhih...@gmail.com |Date: Mon, 11 Aug 2014 15:57:46 -0700 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add | description to building-with-maven.md | |--- | docs/building-with-maven.md | 3 +++ | 1 file changed, 3 insertions(+) | |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- a/docs/building-with-maven.md |+++ b/docs/building-with-maven.md -- File to patch: On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote: You can get the patch from this URL: https://github.com/apache/spark/pull/1893.patch BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml Cheers On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote: See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: (correction: Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = 0.98.2 edit pom.xml with following values hadoop.version2.4.1/hadoop.version protobuf.version2.5.0/protobuf.version yarn.version${hadoop.version}/yarn.version hbase.version0.98.5/hbase.version zookeeper.version3.4.6/zookeeper.version hive.version0.13.1/hive.version SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hbase#hbase;0.98.2: not found [warn] :: sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) at sbt.IvySbt$$anon$3.call(Ivy.scala:60) at
RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)
You can use spark-shell -i file.scala to run that. However, that keeps the interpreter open at the end, so you need to make your file end with System.exit(0) (or even more robustly, do stuff in a try {} and add that in finally {}). In general it would be better to compile apps and run them with spark-submit. The Scala shell isn't that easy for debugging, etc. Matei On August 27, 2014 at 7:23:10 PM, Henry Hung (ythu...@winbond.com) wrote: Update: I use shell script to execute the spark-shell, inside the my-script.sh: $SPARK_HOME/bin/spark-shell $HOME/test.scala $HOME/test.log 21 Although it correctly finish the println(“hallo world”), but the strange thing is that my-script.sh finished before spark-shell even finish executing the script. Best regards, Henry Hung From: MA33 YTHung1 Sent: Thursday, August 28, 2014 10:01 AM To: user@spark.apache.org Subject: how to correctly run scala script using spark-shell through stdin (spark v1.0.0) HI All, Right now I’m trying to execute a script using this command: nohup $SPARK_HOME/bin/spark-shell $HOME/my-script.scala $HOME/my-script.log 21 my-script.scala just have 1 line of code: println(“hallo world”) But after waiting for a minute, I still don’t receive the result from spark-shell. And the output log only contains: Spark assembly has been built with Hive, including Datanucleus jars on classpath SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data1/hadoop/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data1/hadoop/hbase-0.96.0-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/08/28 09:38:27 INFO spark.SecurityManager: Changing view acls to: hadoop 14/08/28 09:38:27 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop) 14/08/28 09:38:27 INFO spark.HttpServer: Starting HTTP Server 14/08/28 09:38:27 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/08/28 09:38:27 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:49382 My question is: What is the right way to execute spark script? I really want to use spark-shell as a way to run cron job in the future, just like when I run R script. Best regards, Henry The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Winbond is strictly prohibited; and any information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond. The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Winbond is strictly prohibited; and any information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond.
Compilation FAILURE : Spark 1.0.2 / Project Hive (0.13.1)
Hi, I use Hadoop 2.4.1, HBase 0.98.5, Zookeeper 3.4.6 and Hive 0.13.1. I just tried to compile Spark 1.0.2, but got error on Spark Project Hive, can you please advise which repository has org.spark-project.hive:hive-metastore:jar:0.13.1? FYI, below is my repository setting in maven which would be old: repository idnexus/id namelocal private nexus/name urlhttp://maven.oschina.net/content/groups/public//url releases enabledtrue/enabled /releases snapshots enabledfalse/enabled /snapshots /repository export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -DskipTests clean package [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [1.892s] [INFO] Spark Project Core SUCCESS [1:33.698s] [INFO] Spark Project Bagel ... SUCCESS [12.270s] [INFO] Spark Project GraphX .. SUCCESS [2:16.343s] [INFO] Spark Project ML Library .. SUCCESS [4:18.495s] [INFO] Spark Project Streaming ... SUCCESS [39.765s] [INFO] Spark Project Tools ... SUCCESS [9.173s] [INFO] Spark Project Catalyst SUCCESS [35.462s] [INFO] Spark Project SQL . SUCCESS [1:16.118s] [INFO] Spark Project Hive FAILURE [1:36.816s] [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Parent POM . SKIPPED [INFO] Spark Project YARN Stable API . SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 12:40.616s [INFO] Finished at: Thu Aug 28 11:41:07 HKT 2014 [INFO] Final Memory: 37M/826M [INFO] [ERROR] Failed to execute goal on project spark-hive_2.10: Could not resolve dependencies for project org.apache.spark:spark-hive_2.10:jar:1.0.2: The following artifacts could not be resolved: org.spark-project.hive:hive-metastore:jar:0.13.1, org.spark-project.hive:hive-exec:jar:0.13.1, org.spark-project.hive:hive-serde:jar:0.13.1: Could not find artifact org.spark-project.hive:hive-metastore:jar:0.13.1 in nexus-osc (http://maven.oschina.net/content/groups/public/) - [Help 1] Regards Arthur
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
Looks like the patch given by that URL only had the last commit. I have attached pom.xml for spark-1.0.2 to SPARK-1297 You can download it and replace examples/pom.xml with the downloaded pom I am running this command locally: mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package Cheers On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thanks. Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.) Is this normal? Regards Arthur patch -p1 -i 1893.patch patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file examples/pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 succeeded at 122 (offset -49 lines). 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 succeeded at 122 (offset -40 lines). Hunk #2 succeeded at 195 (offset -40 lines). On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote: Can you use this command ? patch -p1 -i 1893.patch Cheers On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? Regards Arthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 wget https://github.com/apache/spark/pull/1893.patch patch 1893.patch patching file pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej patching file pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 FAILED at 171. 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej can't find file to patch at input line 267 Perhaps you should have used the -p or --strip option? The text leading up to this was: -- | |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001 |From: tedyu yuzhih...@gmail.com |Date: Mon, 11 Aug 2014 15:57:46 -0700 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add | description to building-with-maven.md | |--- | docs/building-with-maven.md | 3 +++ | 1 file changed, 3 insertions(+) | |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- a/docs/building-with-maven.md |+++ b/docs/building-with-maven.md -- File to patch: On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote: You can get the patch from this URL: https://github.com/apache/spark/pull/1893.patch BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml Cheers On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote: See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: (correction: Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = 0.98.2 edit pom.xml with following values hadoop.version2.4.1/hadoop.version protobuf.version2.5.0/protobuf.version yarn.version${hadoop.version}/yarn.version hbase.version0.98.5/hbase.version zookeeper.version3.4.6/zookeeper.version hive.version0.13.1/hive.version SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hbase#hbase;0.98.2: not found [warn] :: sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) at
Re: Compilation FAILURE : Spark 1.0.2 / Project Hive (0.13.1)
See this thread: http://search-hadoop.com/m/JW1q5wwgyL1/Working+Formula+for+Hive+0.13subj=Re+Working+Formula+for+Hive+0+13+ On Wed, Aug 27, 2014 at 8:54 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I use Hadoop 2.4.1, HBase 0.98.5, Zookeeper 3.4.6 and Hive 0.13.1. I just tried to compile Spark 1.0.2, but got error on Spark Project Hive, can you please advise which repository has org.spark-project.hive:hive-metastore:jar:0.13.1? FYI, below is my repository setting in maven which would be old: repository idnexus/id namelocal private nexus/name urlhttp://maven.oschina.net/content/groups/public//url releases enabledtrue/enabled /releases snapshots enabledfalse/enabled /snapshots /repository export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -DskipTests clean package [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [1.892s] [INFO] Spark Project Core SUCCESS [1:33.698s] [INFO] Spark Project Bagel ... SUCCESS [12.270s] [INFO] Spark Project GraphX .. SUCCESS [2:16.343s] [INFO] Spark Project ML Library .. SUCCESS [4:18.495s] [INFO] Spark Project Streaming ... SUCCESS [39.765s] [INFO] Spark Project Tools ... SUCCESS [9.173s] [INFO] Spark Project Catalyst SUCCESS [35.462s] [INFO] Spark Project SQL . SUCCESS [1:16.118s] [INFO] Spark Project Hive FAILURE [1:36.816s] [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Parent POM . SKIPPED [INFO] Spark Project YARN Stable API . SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 12:40.616s [INFO] Finished at: Thu Aug 28 11:41:07 HKT 2014 [INFO] Final Memory: 37M/826M [INFO] [ERROR] Failed to execute goal on project spark-hive_2.10: Could not resolve dependencies for project org.apache.spark:spark-hive_2.10:jar:1.0.2: The following artifacts could not be resolved: org.spark-project.hive:hive-metastore:jar:0.13.1, org.spark-project.hive:hive-exec:jar:0.13.1, org.spark-project.hive:hive-serde:jar:0.13.1: Could not find artifact org.spark-project.hive:hive-metastore:jar:0.13.1 in nexus-osc ( http://maven.oschina.net/content/groups/public/) - [Help 1] Regards Arthur
Re: Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra
Hi Yana I have done take and confirmed existence of data..Also checked that it is getting connected to Cassandra.. That is why I suspect that this particular rdd is not serializable.. Thanks, Lmk On Aug 28, 2014 5:13 AM, Yana [via Apache Spark User List] ml-node+s1001560n12960...@n3.nabble.com wrote: I'm not so sure that your error is coming from the cassandra write. you have val data = test.map(..).map(..) so data will actually not get created until you try to save it. Can you try to do something like data.count() or data.take(k) after this line and see if you even get to the cassandra part? My suspicion is that you're trying to access something (SparkConf?) within the map closures... -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Cassandra-NotSerializable-Exception-while-saving-to-cassandra-tp12906p12960.html To unsubscribe from Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=12906code=bGFrc2htaS5tdXJhbGlrcmlzaG5hbkBnbWFpbC5jb218MTI5MDZ8LTEzOTI0NzEwNjA= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Cassandra-NotSerializable-Exception-while-saving-to-cassandra-tp12906p12984.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
I forgot to include '-Dhadoop.version=2.4.1' in the command below. The modified command passed. You can verify the dependence on hbase 0.98 through this command: mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txt Cheers On Wed, Aug 27, 2014 at 8:58 PM, Ted Yu yuzhih...@gmail.com wrote: Looks like the patch given by that URL only had the last commit. I have attached pom.xml for spark-1.0.2 to SPARK-1297 You can download it and replace examples/pom.xml with the downloaded pom I am running this command locally: mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package Cheers On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thanks. Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.) Is this normal? Regards Arthur patch -p1 -i 1893.patch patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file examples/pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 succeeded at 122 (offset -49 lines). 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 succeeded at 122 (offset -40 lines). Hunk #2 succeeded at 195 (offset -40 lines). On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote: Can you use this command ? patch -p1 -i 1893.patch Cheers On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? Regards Arthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 wget https://github.com/apache/spark/pull/1893.patch patch 1893.patch patching file pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej patching file pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 FAILED at 171. 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej can't find file to patch at input line 267 Perhaps you should have used the -p or --strip option? The text leading up to this was: -- | |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001 |From: tedyu yuzhih...@gmail.com |Date: Mon, 11 Aug 2014 15:57:46 -0700 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add | description to building-with-maven.md | |--- | docs/building-with-maven.md | 3 +++ | 1 file changed, 3 insertions(+) | |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- a/docs/building-with-maven.md |+++ b/docs/building-with-maven.md -- File to patch: On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote: You can get the patch from this URL: https://github.com/apache/spark/pull/1893.patch BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml Cheers On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote: See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: (correction: Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = 0.98.2 edit pom.xml with following values hadoop.version2.4.1/hadoop.version protobuf.version2.5.0/protobuf.version yarn.version${hadoop.version}/yarn.version hbase.version0.98.5/hbase.version zookeeper.version3.4.6/zookeeper.version hive.version0.13.1/hive.version SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] ::
Update on Pig on Spark initiative
Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other functionality is working quite nicely. We are in the process of merging with Apache Pig trunk(something that should happen over the next 2 weeks). Meanwhile if you are interested in giving it a go, you can try it at https://github.com/sigmoidanalytics/spork This contains all the major changes but may not have all the patches required for 100% e2e, if you are trying it out let me know any issues you face Whole bunch of folks contributed on this Julien Le Dem (Twitter), Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga (Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics), Aniket Mokashi (Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics), Mahesh Kalakoti (Sigmoid Analytics) Not to mention Spark Pig communities. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi
Re: Update on Pig on Spark initiative
Awesome to hear this, Mayur! Thanks for putting this together. Matei On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com) wrote: Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other functionality is working quite nicely. We are in the process of merging with Apache Pig trunk(something that should happen over the next 2 weeks). Meanwhile if you are interested in giving it a go, you can try it at https://github.com/sigmoidanalytics/spork This contains all the major changes but may not have all the patches required for 100% e2e, if you are trying it out let me know any issues you face Whole bunch of folks contributed on this Julien Le Dem (Twitter), Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga (Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics), Aniket Mokashi (Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics), Mahesh Kalakoti (Sigmoid Analytics) Not to mention Spark Pig communities. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi
Submitting multiple files pyspark
Hi, I have two files.. main_app.py and helper.py main_app.py calls some functions in helper.py. I want to use spark-submit to submit a job but how do i specify helper.py? Basically, how do i specify multiple files in spark? Thanks