Re: Spark per app logging
Hi, I'm not completely sure about this either, but this is what we are doing currently: Configure your logging to write to STDOUT, not to a file explicitely. Spark will capture stdour and stderr and separate the messages into a app/driver folder structure in the configured worker directory. We then use logstash to collect the logs and index them to a elasticsearch cluster (Spark seems to produce a lot of logging data). With some simple regex processing, you also get the application id as searchable field. Regards, Jeff 2015-03-20 22:37 GMT+01:00 Ted Yu yuzhih...@gmail.com: Are these jobs the same jobs, just run by different users or, different jobs ? If the latter, can each application use its own log4j.properties ? Cheers On Fri, Mar 20, 2015 at 1:43 PM, Udit Mehta ume...@groupon.com wrote: Hi, We have spark setup such that there are various users running multiple jobs at the same time. Currently all the logs go to 1 file specified in the log4j.properties. Is it possible to configure log4j in spark for per app/user logging instead of sending all logs to 1 file mentioned in the log4j.properties? Thanks Udit
Re: About the env of Spark1.2
Make sure if you are using 127.0.0.1 please check in /etc/hosts and uncheck or create 127.0.1.1 named it as localhost On Sat, Mar 21, 2015 at 9:57 AM, Ted Yu yuzhih...@gmail.com wrote: bq. Caused by: java.net.UnknownHostException: dhcp-10-35-14-100: Name or service not known Can you check your DNS ? Cheers On Fri, Mar 20, 2015 at 8:54 PM, tangzilu zilu.t...@hotmail.com wrote: Hi All: I recently started to deploy Spark1.2 in my VisualBox Linux. But when I run the command ./spark-shell in the path of /opt/spark-1.2.1/bin, I got the result like this: [root@dhcp-10-35-14-100 bin]# ./spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/03/20 13:56:06 INFO SecurityManager: Changing view acls to: root 15/03/20 13:56:06 INFO SecurityManager: Changing modify acls to: root 15/03/20 13:56:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 15/03/20 13:56:06 INFO HttpServer: Starting HTTP Server 15/03/20 13:56:06 INFO Utils: Successfully started service 'HTTP class server' on port 47691. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.2.1 /_/ Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_75) Type in expressions to have them evaluated. Type :help for more information. java.net.UnknownHostException: dhcp-10-35-14-100: dhcp-10-35-14-100: Name or service not known at java.net.InetAddress.getLocalHost(InetAddress.java:1473) at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:710) at org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:702) at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:702) at org.apache.spark.HttpServer.uri(HttpServer.scala:158) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:982) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:123) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:270) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:147) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:962) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
Re: How to set Spark executor memory?
Hi Sean, It's getting strange now. If I ran from IDE, my executor memory is always set to 6.7G, no matter what value I set in code. I have check my environment variable, and there's no value of 6.7, or 12.5 Any idea? Thanks, David On Tue, 17 Mar 2015 00:35 null jishnu.prat...@wipro.com wrote: Hi Xi Shen, You could set the spark.executor.memory in the code itself . new SparkConf()..set(spark.executor.memory, 2g) Or you can try the -- spark.executor.memory 2g while submitting the jar. Regards Jishnu Prathap *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Monday, March 16, 2015 2:06 PM *To:* Xi Shen *Cc:* user@spark.apache.org *Subject:* Re: How to set Spark executor memory? By default spark.executor.memory is set to 512m, I'm assuming since you are submiting the job using spark-submit and it is not able to override the value since you are running in local mode. Can you try it without using spark-submit as a standalone project? Thanks Best Regards On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote: I set it in code, not by configuration. I submit my jar file to local. I am working in my developer environment. On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote: How are you setting it? and how are you submitting the job? Thanks Best Regards On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have set spark.executor.memory to 2048m, and in the UI Environment page, I can see this value has been set correctly. But in the Executors page, I saw there's only 1 executor and its memory is 265.4MB. Very strange value. why not 256MB, or just as what I set? What am I missing here? Thanks, David The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Re: Can I start multiple executors in local mode?
No, I didn't mean local cluster. I mean run in local, like in IDE. On Mon, 16 Mar 2015 23:12 xu Peng hsxup...@gmail.com wrote: Hi David, You can try the local-cluster. the number in local-cluster[2,2,1024] represents that there are 2 worker, 2 cores and 1024M Best Regards Peng Xu 2015-03-16 19:46 GMT+08:00 Xi Shen davidshe...@gmail.com: Hi, In YARN mode you can specify the number of executors. I wonder if we can also start multiple executors at local, just to make the test run faster. Thanks, David
Re: Spark Streaming S3 Performance Implications
Mike: Once hadoop 2.7.0 is released, you should be able to enjoy the enhanced performance of s3a. See HADOOP-11571 Cheers On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly ch...@fregly.com wrote: hey mike! you'll definitely want to increase your parallelism by adding more shards to the stream - as well as spinning up 1 receiver per shard and unioning all the shards per the KinesisWordCount example that is included with the kinesis streaming package. you'll need more cores (cluster) or threads (local) to support this - equalling at least the number of shards/receivers + 1. also, it looks like you're writing to S3 per RDD. you'll want to broaden that out to write DStream batches - or expand even further and write window batches (where the window interval is a multiple of the batch interval). this goes for any spark streaming implementation - not just Kinesis. lemme know if that works for you. thanks! -Chris _ From: Mike Trienis mike.trie...@orcsol.com Sent: Wednesday, March 18, 2015 2:45 PM Subject: Spark Streaming S3 Performance Implications To: user@spark.apache.org Hi All, I am pushing data from Kinesis stream to S3 using Spark Streaming and noticed that during testing (i.e. master=local[2]) the batches (1 second intervals) were falling behind the incoming data stream at about 5-10 events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking at a few seconds to complete. val saveFunc = (rdd: RDD[String], time: Time) = { val count = rdd.count() if (count 0) { val s3BucketInterval = time.milliseconds.toString rdd.saveAsTextFile(s3n://...) } } dataStream.foreachRDD(saveFunc) Should I expect the same behaviour in a deployed cluster? Or does the rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node? Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. Thanks, Mike.
Re: Spark Streaming S3 Performance Implications
hey mike! you'll definitely want to increase your parallelism by adding more shards to the stream - as well as spinning up 1 receiver per shard and unioning all the shards per the KinesisWordCount example that is included with the kinesis streaming package. you'll need more cores (cluster) or threads (local) to support this - equalling at least the number of shards/receivers + 1. also, it looks like you're writing to S3 per RDD. you'll want to broaden that out to write DStream batches - or expand even further and write window batches (where the window interval is a multiple of the batch interval). this goes for any spark streaming implementation - not just Kinesis. lemme know if that works for you. thanks! -Chris _ From: Mike Trienis mike.trie...@orcsol.com Sent: Wednesday, March 18, 2015 2:45 PM Subject: Spark Streaming S3 Performance Implications To: user@spark.apache.org Hi All, I am pushing data from Kinesis stream to S3 using Spark Streaming and noticed that during testing (i.e. master=local[2]) the batches (1 second intervals) were falling behind the incoming data stream at about 5-10 events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking at a few seconds to complete. val saveFunc = (rdd: RDD[String], time: Time) = { val count = rdd.count() if (count 0) { val s3BucketInterval = time.milliseconds.toString rdd.saveAsTextFile(s3n://...) } } dataStream.foreachRDD(saveFunc) Should I expect the same behaviour in a deployed cluster? Or does the rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node? Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. Thanks, Mike.
'nested' RDD problem, advise needed
Hi, I wonder if someone can help suggest a solution to my problem, I had a simple process working using Strings and now want to convert to RDD[Char], the problem is when I end up with a nested call as follow: 1) Load a text file into an RDD[Char] val inputRDD = sc.textFile(“myFile.txt”).flatMap(_.toIterator) 2) I have a method that takes two parameters: object Foo { def myFunction(inputRDD: RDD[Char], int val) : RDD[Char] ... 3) I have a method that the driver process calls once its loaded the inputRDD ‘bar’ as follows: def bar(inputRDD: Rdd[Char) : Int = { val solutionSet = sc.parallelize(1 to alphabetLength toList).map(shift = (shift, Object.myFunction(inputRDD,shift))) What I’m trying to do is take a list 1..26 and generate a set of tuples { (1,RDD(1)), …. (26,RDD(26)) } which is the inputRDD passed through the function above, but with different set of shift parameters. In my original I could parallelise the algorithm fine, but my input string had to be in a ‘String’ variable, I’d rather it be an RDD (string could be large). I think the way I’m trying to do it above won’t work because its a nested RDD call. Can anybody suggest a solution? Regards, Mike Lewis - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming Not Reading Messages From Multiple Kafka Topics
Hey Eason! Weird problem indeed. More information will probably help to find te issue: Have you searched the logs for peculiar messages? How does your Spark environment look like? #workers, #threads, etc? Does it work if you create separate receivers for the topics? Regards, Jeff 2015-03-21 2:27 GMT+01:00 EH eas...@gmail.com: Hi all, I'm building a Spark Streaming application that will continuously read multiple kafka topics at the same time. However, I found a weird issue that it reads only hundreds of messages then it stopped reading any more. If I changed the three topic to only one topic, then it is fine and it will continue to consume. Below is the code I have. val consumerThreadsPerInputDstream = 1 val topics = Map(raw_0 - consumerThreadsPerInputDstream) raw_1 - consumerThreadsPerInputDstream, raw_2 - consumerThreadsPerInputDstream) val msgs = KafkaUtils.createStream(ssc, 10.10.10.10:2181/hkafka, group01, topics).map(_._2) ... How come it will no longer consume after hundreds of messages for three topic reading? How to resolve this issue? Thank you for your help, Eason -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Not-Reading-Messages-From-Multiple-Kafka-Topics-tp22170.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to set Spark executor memory?
If you are running from your IDE, then I don't know what you are running or in what mode. The discussion here concerns using standard mechanisms like spark-submit to configure executor memory. Please try these first instead of trying to directly invoke Spark, which will require more understanding of how the props are set. On Sat, Mar 21, 2015 at 5:30 AM, Xi Shen davidshe...@gmail.com wrote: Hi Sean, It's getting strange now. If I ran from IDE, my executor memory is always set to 6.7G, no matter what value I set in code. I have check my environment variable, and there's no value of 6.7, or 12.5 Any idea? Thanks, David On Tue, 17 Mar 2015 00:35 null jishnu.prat...@wipro.com wrote: Hi Xi Shen, You could set the spark.executor.memory in the code itself . new SparkConf()..set(spark.executor.memory, 2g) Or you can try the -- spark.executor.memory 2g while submitting the jar. Regards Jishnu Prathap From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, March 16, 2015 2:06 PM To: Xi Shen Cc: user@spark.apache.org Subject: Re: How to set Spark executor memory? By default spark.executor.memory is set to 512m, I'm assuming since you are submiting the job using spark-submit and it is not able to override the value since you are running in local mode. Can you try it without using spark-submit as a standalone project? Thanks Best Regards On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote: I set it in code, not by configuration. I submit my jar file to local. I am working in my developer environment. On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote: How are you setting it? and how are you submitting the job? Thanks Best Regards On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have set spark.executor.memory to 2048m, and in the UI Environment page, I can see this value has been set correctly. But in the Executors page, I saw there's only 1 executor and its memory is 265.4MB. Very strange value. why not 256MB, or just as what I set? What am I missing here? Thanks, David The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Model deployment help
Hi, Apologies for the generic question. As I am developing predictive models for the first time and soon model will be deployed in production very soon. Could somebody help me with the model deployment in production , I have read quite a few on model deployment and have read some books on Database deployment . My queries relate to how updates to model happen when current model degenerates without any downtime and how others are deploying in production servers and a few lines on adoption of PMML currently in production. Please provide me with some good links or some forums so that I can learn as most of the books do not cover it extensively except for 'Mahout in action' where it is explained in some detail and have also checked stackoverflow but have not got any relevant answers. What I understand: 1. Build model using current training set and test the model. 2. Deploy the model,put it in some location and load it and predict when request comes for scoring. 3. Model degenerates , now build new model with new data.(Here some confusion , whether the old data is discarded completely or it is done with purely new data or a mix) 4. Here I am stuck , how to update the model without any downtime, the transition period when old model and new model happens. My naive solution would be, build the new model , save it in a new location and update the new path in some properties file or update the location in database when the saving is done. Is this correct or some best practices are available. Database is unlikely in my case. Thanks in advance.
Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged
bq. Requesting 1 new executor(s) because tasks are backlogged 1 executor was requested. Which hadoop release are you using ? Can you check resource manager log to see if there is some clue ? Thanks On Fri, Mar 20, 2015 at 4:17 PM, Manoj Samel manojsamelt...@gmail.com wrote: Forgot to add - the cluster is idle otherwise so there should be no resource issues. Also the configuration works when not using Dynamic allocation. On Fri, Mar 20, 2015 at 4:15 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, Running Spark 1.3 with secured Hadoop. Spark-shell with Yarn client mode runs without issue when not using Dynamic Allocation. When Dynamic allocation is turned on, the shell comes up but same SQL etc. causes it to loop. spark.dynamicAllocation.enabled=true spark.dynamicAllocation.initialExecutors=1 spark.dynamicAllocation.maxExecutors=10 # Set IdleTime low for testing spark.dynamicAllocation.executorIdleTimeout=60 spark.shuffle.service.enabled=true Following is the start of the messages and then it keeps looping with Requesting 0 new executors 15/03/20 22:52:42 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0 15/03/20 22:52:42 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839 15/03/20 22:52:42 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3] at mapPartitions at Exchange.scala:100) 15/03/20 22:52:42 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks 15/03/20 22:52:47 INFO spark.ExecutorAllocationManager: Requesting 1 new executor(s) because tasks are backlogged (new desired total will be 1) 15/03/20 22:52:52 INFO spark.ExecutorAllocationManager: Requesting 0 new executor(s) because tasks are backlogged (new desired total will be 1) 15/03/20 22:52:57 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 15/03/20 22:52:57 INFO spark.ExecutorAllocationManager: Requesting 0 new executor(s) because tasks are backlogged (new desired total will be 1) 15/03/20 22:53:02 INFO spark.ExecutorAllocationManager: Requesting 0 new executor(s) because tasks are backlogged (new desired total will be 1) 15/03/20 22:53:07 INFO spark.ExecutorAllocationManager: Requesting 0 new executor(s) because tasks are backlogged (new desired total will be 1) 15/03/20 22:53:12 INFO spark.ExecutorAllocationManager: Requesting 0 new executor(s) because tasks are backlogged (new desired total will be 1) 15/03/20 22:53:12 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 15/03/20 22:53:17 INFO spark.ExecutorAllocationManager: Requesting 0 new executor(s) because tasks are backlogged (new desired total will be 1) 15/03/20 22:53:22 INFO spark.ExecutorAllocationManager: Requesting 0 new executor(s) because tasks are backlogged (new desired total will be 1) 15/03/20 22:53:27 INFO spark.ExecutorAllocationManager: Requesting 0 new executor(s) because tasks are backlogged (new desired total will be 1)
Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues
Thank you for your help Akhil! We found that it is no longer working from our laptop to remotely connect to the remote Spark cluster, but it works if the client is on the remote cluster as well, starting from the version 1.2.0 and beyond (v1.1.1 and below are fine). Not sure if this is related that Spark's internal communication got upgraded to a netty based implementation which may not fit our firewall / network setup between laptop and remote servers: https://issues.apache.org/jira/browse/SPARK-2468 in v1.2.0. This is not very good for project development debugging since for every little change we need to recompile the entire jar and upload to remote server then execute, instead of running it right the way on local machine, but at least it works now. Best, Eason On Thu, Mar 19, 2015 at 11:35 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you submitting your application from local to a remote host? If you want to run the spark application from a remote machine, then you have to at least set the following configurations properly. - *spark.driver.host* - points to the ip/host from where you are submitting the job (make sure you are able to ping this from the cluster) - *spark.driver.port* - set it to a port number which is accessible from the spark cluster. You can look at more configuration options over here. http://spark.apache.org/docs/latest/configuration.html#networking Thanks Best Regards On Fri, Mar 20, 2015 at 4:02 AM, Eason Hu eas...@gmail.com wrote: Hi Akhil, Thank you for your help. I just found that the problem is related to my local spark application, since I ran it in IntelliJ and I didn't reload the project after I recompile the jar via maven. If I didn't reload, it will use some local cache data to run the application which leads to two different versions. After I reloaded the project and reran, it was running fine for v1.1.1 and I no longer saw that class incompatible issues. However, I now encounter a new issue starting from v1.2.0 and above. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/03/19 01:10:17 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 15/03/19 01:10:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/19 01:10:17 INFO SecurityManager: Changing view acls to: hduser,eason.hu 15/03/19 01:10:17 INFO SecurityManager: Changing modify acls to: hduser,eason.hu 15/03/19 01:10:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser, eason.hu); users with modify permissions: Set(hduser, eason.hu) 15/03/19 01:10:18 INFO Slf4jLogger: Slf4jLogger started 15/03/19 01:10:18 INFO Remoting: Starting remoting 15/03/19 01:10:18 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@hduser-07:59122] 15/03/19 01:10:18 INFO Utils: Successfully started service 'driverPropsFetcher' on port 59122. 15/03/19 01:10:21 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.168.1.53:65001] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://sparkDriver@192.168.1.53:65001]]. 15/03/19 01:10:48 ERROR UserGroupInformation: PriviledgedActionException as:eason.hu (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:128) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:224) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) ... 4 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at
Re: Accessing AWS S3 in Frankfurt (v4 only - AWS4-HMAC-SHA256)
1. make sure your secret key doesn't have a / in it. If it does, generate a new key. 2. jets3t and hadoop JAR versions need to be in sync; jets3t 0.9.0 was picked up in Hadoop 2.4 and not AFAIK 3. Hadoop 2.6 has a new S3 client, s3a, which compatible with s3n data. It uses the AWS toolkit over JetS3t, where all future dev is going. Assuming it is up date with the AWS toolkit, it will do the auth. Not knowingly tested against frankfurt though; just ireland, US east, US west Japan. S3a still has some quirks being worked through; HADOOP-11571 lists the set fixed. On 20 Mar 2015, at 15:15, Ralf Heyde r...@hubrick.commailto:r...@hubrick.com wrote: Good Idea, will try that. But assuming, only data is located there, the problem will still occur. On Fri, Mar 20, 2015 at 3:08 PM, Gourav Sengupta gourav.sengu...@gmail.commailto:gourav.sengu...@gmail.com wrote: Hi Ralf, using secret keys and authorization details is a strict NO for AWS, they are major security lapses and should be avoided at any cost. Have you tried starting the clusters using ROLES, they are wonderful way to start clusters or EC2 nodes and you do not have to copy and paste any permissions either. Try going through this article in AWS: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-iam-roles.html (though for datapipeline, they show the correct set of permissions to enable). I start EC2 nodes using roles (as mentioned in the link above), run the aws cli commands (without copying any keys or files). Please let me know if the issue was resolved. Regards, Gourav On Fri, Mar 20, 2015 at 1:53 PM, Ralf Heyde r...@hubrick.commailto:r...@hubrick.com wrote: Hey, We want to run a Job, accessing S3, from EC2 instances. The Job runs in a self-provided Spark Cluster (1.3.0) on EC2 instances. In Irland everything works as expected. i just tried to move data from Irland - Frankfurt. AWS S3 is forcing v4 of their API there, means: access is only possible via: AWS4-HMAC-SHA256 This is still ok, but I dont get access there. What I tried already: All of the Approaches I tried with these URLs: A) s3n://key:secret@bucket/path/ B) s3://key:secret@bucket/path/ C) s3n://bucket/path/ D) s3://bucket/path/ 1a. setting Environment Variables in the operating system 1b. found something, to set AccessKey/Secret in SparkConf like that (I guess, this does not have any effect) sc.set(AWS_ACCESS_KEY_ID, id) sc.set(AWS_SECRET_ACCESS_KEY, secret) 2. tried to use a more up to date jets3t client (somehow I was not able to get the new version running) 3. tried in-URL basic authentication (A+B) 4. Setting the hadoop configuration: hadoopConfiguration.set(fs.s3n.impl, org.apache.hadoop.fs.s3.S3FileSystem); hadoopConfiguration.set(fs.s3n.awsAccessKeyId, key); hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, secret); hadoopConfiguration.set(fs.s3.impl, org.apache.hadoop.fs.s3.S3FileSystem); hadoopConfiguration.set(fs.s3.awsAccessKeyId, myAccessKey); hadoopConfiguration.set(fs.s3.awsSecretAccessKey, myAccessSecret); -- Caused by: org.jets3t.service.S3ServiceException: S3 GET failed for '/%2FEAN%2F2015-03-09-72640385%2Finput%2FHotelImageList.gz' XML Error Message: ?xml version=1.0 encoding=UTF-8?ErrorCodeInvalidRequest/CodeMessageThe authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256./MessageRequestId43F8F02E767DC4A2/RequestIdHostIdwgMeAEYcZZa/2BazQ9TA+PAkUxt5l+ExnT4Emb+1Uk5KhWfJu5C8Xcesm1AXCfJ9nZJMyh4wPX8=/HostId/Error 2. setting Hadoop Configuration hadoopConfiguration.set(fs.s3n.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem); hadoopConfiguration.set(fs.s3n.awsAccessKeyId, key); hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, secret); hadoopConfiguration.set(fs.s3.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem); hadoopConfiguration.set(fs.s3.awsAccessKeyId, myAccessKey); hadoopConfiguration.set(fs.s3.awsSecretAccessKey, myAccessSecret); -- Caused by: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/EAN%2F2015-03-09-72640385%2Finput%2FHotelImageList.gz' - ResponseCode=400, ResponseMessage=Bad Request 5. without Hadoop Config Exception in thread main java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively). 6. without Hadoop Config but passed in S3 URL with A) Exception in thread main org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/EAN%2F2015-03-09-72640385%2Finput%2FHotelImageList.gz' - ResponseCode=400, ResponseMessage=Bad Request with B) Exception in thread main java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties
Re: saveAsTable broken in v1.3 DataFrames?
I believe that you can get what you want by using HiveQL instead of the pure programatic API. This is a little verbose so perhaps a specialized function would also be useful here. I'm not sure I would call it saveAsExternalTable as there are also external spark sql data source tables that have nothing to do with hive. The following should create a proper hive table: df.registerTempTable(df) sqlContext.sql(CREATE TABLE newTable AS SELECT * FROM df) At the very least we should clarify in the documentation to avoid future confusion. The piggybacking is a little unfortunate but also gives us a lot of new functionality that we can't get when strictly following the way that Hive expects tables to be formatted. I'd suggest opening a JIRA for the specialized method you describe. Feel free to mention me and Yin in a comment when create you it. On Fri, Mar 20, 2015 at 12:55 PM, Christian Perez christ...@svds.com wrote: Any other users interested in a feature DataFrame.saveAsExternalTable() for making _useful_ external tables in Hive, or am I the only one? Bueller? If I start a PR for this, will it be taken seriously? On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez christ...@svds.com wrote: Hi Yin, Thanks for the clarification. My first reaction is that if this is the intended behavior, it is a wasted opportunity. Why create a managed table in Hive that cannot be read from inside Hive? I think I understand now that you are essentially piggybacking on Hive's metastore to persist table info between/across sessions, but I imagine others might expect more (as I have.) We find ourselves wanting to do work in Spark and persist the results where other users (e.g. analysts using Tableau connected to Hive/Impala) can explore it. I imagine this is very common. I can, of course, save it as parquet and create an external table in hive (which I will do now), but saveAsTable seems much less useful to me now. Any other opinions? Cheers, C On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai yh...@databricks.com wrote: I meant table properties and serde properties are used to store metadata of a Spark SQL data source table. We do not set other fields like SerDe lib. For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table should not show unrelated stuff like Serde lib and InputFormat. I have created https://issues.apache.org/jira/browse/SPARK-6413 to track the improvement on the output of DESCRIBE statement. On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai yh...@databricks.com wrote: Hi Christian, Your table is stored correctly in Parquet format. For saveAsTable, the table created is not a Hive table, but a Spark SQL data source table ( http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources ). We are only using Hive's metastore to store the metadata (to be specific, only table properties and serde properties). When you look at table property, there will be a field called spark.sql.sources.provider and the value will be org.apache.spark.sql.parquet.DefaultSource. You can also look at your files in the file system. They are stored by Parquet. Thanks, Yin On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez christ...@svds.com wrote: Hi all, DataFrame.saveAsTable creates a managed table in Hive (v0.13 on CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong* schema _and_ storage format in the Hive metastore, so that the table cannot be read from inside Hive. Spark itself can read the table, but Hive throws a Serialization error because it doesn't know it is Parquet. val df = sc.parallelize( Array((1,2), (3,4)) ).toDF(education, income) df.saveAsTable(spark_test_foo) Expected: COLUMNS( education BIGINT, income BIGINT ) SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat Actual: COLUMNS( col arraystring COMMENT from deserializer ) SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat --- Manually changing schema and storage restores access in Hive and doesn't affect Spark. Note also that Hive's table property spark.sql.sources.schema is correct. At first glance, it looks like the schema data is serialized when sent to Hive but not deserialized properly on receive. I'm tracing execution through source code... but before I get any deeper, can anyone reproduce this behavior? Cheers, Christian -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.com @cp_phd - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org --
Spark streaming alerting
Is there a module in spark streaming that lets you listen to the alerts/conditions as they happen in the streaming module? Generally spark streaming components will execute on large set of clusters like hdfs or Cassandra, however when it comes to alerting you generally can't send it directly from the spark workers, which means you need a way to listen to the alerts.
join two DataFrames, same column name
I have a couple of data frames that I pulled from SparkSQL and the primary key of one is a foreign key of the same name in the other. I'd rather not have to specify each column in the SELECT statement just so that I can rename this single column. When I try to join the data frames, I get an exception because it finds the two columns of the same name to be ambiguous. Is there a way to specify which side of the join comes from data frame A and which comes from B? var df1 = sqlContext.sql(select * from table1) var df2 = sqlContext.sql(select * from table2) df1.join(df2, df1(column_id) === df2(column_id))
Re: Did DataFrames break basic SQLContext?
Now, I am not able to directly use my RDD object and have it implicitly become a DataFrame. It can be used as a DataFrameHolder, of which I could write: rdd.toDF.registerTempTable(foo) The rational here was that we added a lot of methods to DataFrame and made the implicits more powerful, but that increased the likelihood of accidental application of the implicit. I personally have had to explain the accidental application of implicits (and the confusing compiler messages that can result) to beginners so many times that we decided to remove the subtle conversion from RDD to DataFrame and instead make it explicit method call.
Re: How to set Spark executor memory?
In the log, I saw MemoryStorage: MemoryStore started with capacity 6.7GB But I still can not find where to set this storage capacity. On Sat, 21 Mar 2015 20:30 Xi Shen davidshe...@gmail.com wrote: Hi Sean, It's getting strange now. If I ran from IDE, my executor memory is always set to 6.7G, no matter what value I set in code. I have check my environment variable, and there's no value of 6.7, or 12.5 Any idea? Thanks, David On Tue, 17 Mar 2015 00:35 null jishnu.prat...@wipro.com wrote: Hi Xi Shen, You could set the spark.executor.memory in the code itself . new SparkConf()..set(spark.executor.memory, 2g) Or you can try the -- spark.executor.memory 2g while submitting the jar. Regards Jishnu Prathap *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Monday, March 16, 2015 2:06 PM *To:* Xi Shen *Cc:* user@spark.apache.org *Subject:* Re: How to set Spark executor memory? By default spark.executor.memory is set to 512m, I'm assuming since you are submiting the job using spark-submit and it is not able to override the value since you are running in local mode. Can you try it without using spark-submit as a standalone project? Thanks Best Regards On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote: I set it in code, not by configuration. I submit my jar file to local. I am working in my developer environment. On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote: How are you setting it? and how are you submitting the job? Thanks Best Regards On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have set spark.executor.memory to 2048m, and in the UI Environment page, I can see this value has been set correctly. But in the Executors page, I saw there's only 1 executor and its memory is 265.4MB. Very strange value. why not 256MB, or just as what I set? What am I missing here? Thanks, David The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
netlib-java cannot load native lib in Windows when using spark-submit
Hi, I use the *OpenBLAS* DLL, and have configured my application to work in IDE. When I start my Spark application from IntelliJ IDE, I can see in the log that the native lib is loaded successfully. But if I use *spark-submit* to start my application, the native lib still cannot be load. I saw the WARN message that it failed to load both the native and native-ref library. I checked the *Environment* tab in the Spark UI, and the *java.library.path* is set correctly. Thanks, David
Re: How to set Spark executor memory?
Yeah, I think it is harder to troubleshot the properties issues in a IDE. But the reason I stick to IDE is because if I use spark-submit, the BLAS native cannot be loaded. May be I should open another thread to discuss that. Thanks, David On Sun, 22 Mar 2015 10:38 Xi Shen davidshe...@gmail.com wrote: In the log, I saw MemoryStorage: MemoryStore started with capacity 6.7GB But I still can not find where to set this storage capacity. On Sat, 21 Mar 2015 20:30 Xi Shen davidshe...@gmail.com wrote: Hi Sean, It's getting strange now. If I ran from IDE, my executor memory is always set to 6.7G, no matter what value I set in code. I have check my environment variable, and there's no value of 6.7, or 12.5 Any idea? Thanks, David On Tue, 17 Mar 2015 00:35 null jishnu.prat...@wipro.com wrote: Hi Xi Shen, You could set the spark.executor.memory in the code itself . new SparkConf()..set(spark.executor.memory, 2g) Or you can try the -- spark.executor.memory 2g while submitting the jar. Regards Jishnu Prathap *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Monday, March 16, 2015 2:06 PM *To:* Xi Shen *Cc:* user@spark.apache.org *Subject:* Re: How to set Spark executor memory? By default spark.executor.memory is set to 512m, I'm assuming since you are submiting the job using spark-submit and it is not able to override the value since you are running in local mode. Can you try it without using spark-submit as a standalone project? Thanks Best Regards On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote: I set it in code, not by configuration. I submit my jar file to local. I am working in my developer environment. On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote: How are you setting it? and how are you submitting the job? Thanks Best Regards On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have set spark.executor.memory to 2048m, and in the UI Environment page, I can see this value has been set correctly. But in the Executors page, I saw there's only 1 executor and its memory is 265.4MB. Very strange value. why not 256MB, or just as what I set? What am I missing here? Thanks, David The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Reducing Spark's logging verbosity
Hi, Does anyone have concrete recommendations how to reduce Spark's logging verbosity. We have attempted on several occasions to address this by setting various log4j properties, both in configuration property files and in $SPARK_HOME/conf/ spark-env.sh; however, all of those attempts have failed. Any suggestions are welcome. Thank you, Edmon
Re: netlib-java cannot load native lib in Windows when using spark-submit
Can you try the --driver-library-path option ? spark-submit --driver-library-path /opt/hadoop/lib/native ... Cheers On Sat, Mar 21, 2015 at 4:58 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I use the *OpenBLAS* DLL, and have configured my application to work in IDE. When I start my Spark application from IntelliJ IDE, I can see in the log that the native lib is loaded successfully. But if I use *spark-submit* to start my application, the native lib still cannot be load. I saw the WARN message that it failed to load both the native and native-ref library. I checked the *Environment* tab in the Spark UI, and the *java.library.path* is set correctly. Thanks, David
Error while installing Spark 1.3.0 on local machine
Hello, I am trying to install Spark 1.3.0 on my mac. Earlier, I was working with Spark 1.1.0. Now, I come across this error : sbt.ResolveException: unresolved dependency: org.apache.spark#spark-network-common_2.10;1.3.0: configuration not public in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was required from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:278) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56) at sbt.IvySbt$$anon$4.call(Ivy.scala:64) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97) at xsbt.boot.Using$.withResource(Using.scala:10) at xsbt.boot.Using$.apply(Using.scala:9) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:64) at sbt.IvySbt.withIvy(Ivy.scala:123) at sbt.IvySbt.withIvy(Ivy.scala:120) at sbt.IvySbt$Module.withModule(Ivy.scala:151) at sbt.IvyActions$.updateEither(IvyActions.scala:157) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1318) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1315) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1345) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1343) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40) at sbt.std.Transform$$anon$4.work(System.scala:63) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17) at sbt.Execute.work(Execute.scala:235) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159) at sbt.CompletionService$$anon$2.call(CompletionService.scala:28) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) [error] (network-shuffle/*:update) sbt.ResolveException: unresolved dependency: org.apache.spark#spark-network-common_2.10;1.3.0: configuration not public in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was required from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test [error] Total time: 5 s, completed Mar 21, 2015 7:48:45 PM I tried uninstalling and re - installing, when I browsed over the internet, I came across suggestions to include -Phadoop, now even if I use build/sbt -Pyarn -Phadoop-2.3 assembly It gives me an error. I greatly appreciate any help. Thank you for your time. -- Regards, Haripriya Ayyalasomayajula Graduate Student Department of Computer Science University of Houston Contact : 650-796-7112
Re: How to set Spark executor memory?
bq. the BLAS native cannot be loaded Have you tried specifying --driver-library-path option ? Cheers On Sat, Mar 21, 2015 at 4:42 PM, Xi Shen davidshe...@gmail.com wrote: Yeah, I think it is harder to troubleshot the properties issues in a IDE. But the reason I stick to IDE is because if I use spark-submit, the BLAS native cannot be loaded. May be I should open another thread to discuss that. Thanks, David On Sun, 22 Mar 2015 10:38 Xi Shen davidshe...@gmail.com wrote: In the log, I saw MemoryStorage: MemoryStore started with capacity 6.7GB But I still can not find where to set this storage capacity. On Sat, 21 Mar 2015 20:30 Xi Shen davidshe...@gmail.com wrote: Hi Sean, It's getting strange now. If I ran from IDE, my executor memory is always set to 6.7G, no matter what value I set in code. I have check my environment variable, and there's no value of 6.7, or 12.5 Any idea? Thanks, David On Tue, 17 Mar 2015 00:35 null jishnu.prat...@wipro.com wrote: Hi Xi Shen, You could set the spark.executor.memory in the code itself . new SparkConf()..set(spark.executor.memory, 2g) Or you can try the -- spark.executor.memory 2g while submitting the jar. Regards Jishnu Prathap *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Monday, March 16, 2015 2:06 PM *To:* Xi Shen *Cc:* user@spark.apache.org *Subject:* Re: How to set Spark executor memory? By default spark.executor.memory is set to 512m, I'm assuming since you are submiting the job using spark-submit and it is not able to override the value since you are running in local mode. Can you try it without using spark-submit as a standalone project? Thanks Best Regards On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote: I set it in code, not by configuration. I submit my jar file to local. I am working in my developer environment. On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote: How are you setting it? and how are you submitting the job? Thanks Best Regards On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have set spark.executor.memory to 2048m, and in the UI Environment page, I can see this value has been set correctly. But in the Executors page, I saw there's only 1 executor and its memory is 265.4MB. Very strange value. why not 256MB, or just as what I set? What am I missing here? Thanks, David The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Re: Model deployment help
Hi Shashidhar, Our team at PredictionIO is trying to solve the production deployment of model. We built a powered-by-Spark framework (also certified on Spark by Databricks) that allows a user to build models with everything available from the Spark API, persist the model automatically with versioning, and deploy as a REST service using simple CLI commands. Regarding model degeneration and updates, if having a half to couple seconds downtime is acceptable, with PIO one could simply run pio train and pio deploy periodically with a cronjob. To achieve virtually zero downtime, a load balancer could be setup in front of 2 pio deploy instances. Porting your current algorithm / model generation to PredictionIO should just be a copy-and-paste procedure. We would be very grateful for any feedback that would improve the deployment process. We do not support PMML at the moment, but definitely are interested in your use case. You may get started with the documentation (http://docs.prediction.io/). You could also visit the engine template gallery ( https://templates.prediction.io/) for quick, ready-to-use examples. Prediction is open source software under APL2 on https://github.com/PredictionIO/PredictionIO. Looking forward to hearing your feedback! Best Regards, Donald ᐧ On Sat, Mar 21, 2015 at 10:40 AM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, Apologies for the generic question. As I am developing predictive models for the first time and soon model will be deployed in production very soon. Could somebody help me with the model deployment in production , I have read quite a few on model deployment and have read some books on Database deployment . My queries relate to how updates to model happen when current model degenerates without any downtime and how others are deploying in production servers and a few lines on adoption of PMML currently in production. Please provide me with some good links or some forums so that I can learn as most of the books do not cover it extensively except for 'Mahout in action' where it is explained in some detail and have also checked stackoverflow but have not got any relevant answers. What I understand: 1. Build model using current training set and test the model. 2. Deploy the model,put it in some location and load it and predict when request comes for scoring. 3. Model degenerates , now build new model with new data.(Here some confusion , whether the old data is discarded completely or it is done with purely new data or a mix) 4. Here I am stuck , how to update the model without any downtime, the transition period when old model and new model happens. My naive solution would be, build the new model , save it in a new location and update the new path in some properties file or update the location in database when the saving is done. Is this correct or some best practices are available. Database is unlikely in my case. Thanks in advance. -- Donald Szeto PredictionIO
How to do nested foreach with RDD
Hi, I have two big RDD, and I need to do some math against each pair of them. Traditionally, it is like a nested for-loop. But for RDD, it cause a nested RDD which is prohibited. Currently, I am collecting one of them, then do a nested for-loop, so to avoid nested RDD. But would like to know if there's spark-way to do this. Thanks, David