Spark SQL 1.2 with CDH 4, Hive UDF is not working.
Hi, Recently I'm migrating from Shark 0.9 to Spark SQL 1.2, my CDH version is 4.5, Hive 0.11. I've managed to setup Spark SQL Thriftserver, and normal queries work fine, but custom UDF is not usable. The symptom is when executing CREATE TEMPORARY FUNCTION, the query hangs on a lock request: 14/12/22 14:41:57 DEBUG ClientCnxn: Reading reply sessionid:0x34a6121e6d93e74, packet:: clientPath:null serverPath:null finished:false header:: 289,8 replyHeader:: 289,51540866762,0 request:: '/hive_zookeeper_namespace_hive1/default,F response:: v{'sample_07,'LOCK-EXCLUSIVE-0001565612,'LOCK-EXCLUSIVE-0001565957} 14/12/22 14:41:57 ERROR ZooKeeperHiveLockManager: conflicting lock present for default mode EXCLUSIVE Is it a compatibility issue because Spark SQL 1.2 is based on Hive 0.13? Is there a workaround instead of upgrading CDH or forbidding UDF on Spark SQL? Thanks. -- Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
what is the default log4j configuration passed to yarn container
Hi, In case of MR task the log4j configuration and container log folder for a container is explicitly set in the container Launch context by org.apache.hadoop.mapreduce.v2.util.MRApps.addLog4jSystemProperties i.e from MapReduce YARN client code and not YARN code. This is also visible from jinfo on the MR's container's pid -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=container log dir/application_1413959638984_0032/container_1413959638984_0032_01_01 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx1024m But in case of spark we don't see the log4j.configuration=container-log4j.propertiesto be set by default then -XX:OnOutOfMemoryError=kill %p -Xms512m -Xmx512m -Djava.io.tmpdir=local dir/usercache/root/appcache/application_1413959638984_0027/container_1413959638984_0027_01_03/tmp -Dlog4j.configuration.watch=false -Dspark.akka.timeout=100 -Dspark.akka.frameSize=10 -Dspark.akka.heartbeat.pauses=600 -Dspark.akka.threads=4 -Dspark.yarn.app.container.log.dir=container log dir/application_1413959638984_0027/container_1413959638984_0027_01_03 If the user doesn't set custom log4j configuration, how default spark container log4j settings are done ? Is any Log4j configuration file is set in the class path/jars or the log4j appender is programmatically set ? As the container log folder is different for each container, static log4j cannot be used so is spark.yarn.app.container.log.dir might be getting set in the log4j.properties but not clear who or where its set, can anyone give more information on this ? Regards, Ramana - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL 1.2 with CDH 4, Hive UDF is not working.
Hi Ji, Spark SQL 1.2 only works with either Hive 0.12.0 or 0.13.1 due to Hive API/protocol compatibility issues. When interacting with Hive 0.11.x, connections and simple queries may succeed, but things may go crazy in unexpected corners (like UDF). Cheng On 12/22/14 4:15 PM, Ji ZHANG wrote: Hi, Recently I'm migrating from Shark 0.9 to Spark SQL 1.2, my CDH version is 4.5, Hive 0.11. I've managed to setup Spark SQL Thriftserver, and normal queries work fine, but custom UDF is not usable. The symptom is when executing CREATE TEMPORARY FUNCTION, the query hangs on a lock request: 14/12/22 14:41:57 DEBUG ClientCnxn: Reading reply sessionid:0x34a6121e6d93e74, packet:: clientPath:null serverPath:null finished:false header:: 289,8 replyHeader:: 289,51540866762,0 request:: '/hive_zookeeper_namespace_hive1/default,F response:: v{'sample_07,'LOCK-EXCLUSIVE-0001565612,'LOCK-EXCLUSIVE-0001565957} 14/12/22 14:41:57 ERROR ZooKeeperHiveLockManager: conflicting lock present for default mode EXCLUSIVE Is it a compatibility issue because Spark SQL 1.2 is based on Hive 0.13? Is there a workaround instead of upgrading CDH or forbidding UDF on Spark SQL? Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Graceful shutdown in spark streaming
Hello all, I have a spark streaming application running in a standalone cluster (deployed with spark-submit --deploy-mode cluster). I am trying to add graceful shutdown functionality to this application but I am not sure what is the best practice for this. Currently I am using this code: sys.addShutdownHook { log.info(Shutdown requested. Graceful shutdown started.) ssc.stop(stopSparkContext = true, stopGracefully= true) log.info(Shutdown Complete. Bye) } ssc.start() ssc.awaitTermination() It seems to be working but I still get some errors in the driver log and the master UI shows failed as status for the driver after it is stopped Driver log: 14/12/22 16:42:30 INFO Main: Shutdown requested. Graceful shutdown started. . . . 14/12/22 16:42:40 INFO JobGenerator: Stopping JobGenerator gracefully. . . . 14/12/22 16:42:50 INFO DAGScheduler: Job 1 failed: start at Main.scala:114, took 93.444222 s Exception in thread Thread-34 org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702) at org.apache.spark.scheduler.DAGScheduler$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428) at akka.actor.Actor$class.aroundPostStop(Actor.scala:475) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1375) at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$finishTerminate(FaultHandling.scala:210) at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) at akka.actor.ActorCell.terminate(ActorCell.scala:369) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/12/22 16:42:51 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/12/22 16:42:51 INFO MemoryStore: MemoryStore cleared 14/12/22 16:42:51 INFO BlockManager: BlockManager stopped 14/12/22 16:42:51 INFO BlockManagerMaster: BlockManagerMaster stopped 14/12/22 16:42:51 INFO SparkContext: Successfully stopped SparkContext 14/12/22 16:42:51 INFO Main: Shutdown Complete. Bye (end of driver log) Anyone has experience to share regarding graceful shutdown in production for spark streaming? Thanks! Best Regards, Jesper Lundgren
Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank
Did you try running PageRank.scala instead of LiveJournalPageRank.scala? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p20808.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to get list of edges between two Vertex ?
Do you need the multiple edges or can you get the work done by having single edge between two vertices? In my view point, you can group the edges using groupEdges which will group the same edges together. It may work because the message passed between the vertices through same edges (replicated) will not be different. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-list-of-edges-between-two-Vertex-tp19309p20809.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Possible problems in packaging mlllib
I am trying to run the twitter classifier https://github.com/databricks/reference-apps A NoClasssDefFoundError pops up. I've checked the library that the HashingTF class file is there. Some stack overflow questions show that might be problem with packaging the class. Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/mllib/feature/HashingTF at com.databricks.apps.twitter_classifier.Utils$.init(Utils.scala:12) at com.databricks.apps.twitter_classifier.Utils$.clinit(Utils.scala) at com.databricks.apps.twitter_classifier.Collect$.main(Collect.scala:26) at com.databricks.apps.twitter_classifier.Collect.main(Collect.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) Caused by: java.lang.ClassNotFoundException: org.apache.spark.mllib.feature.HashingTF at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 9 more -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-problems-in-packaging-mlllib-tp20810.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Possible problems in packaging mlllib
Are you using an old version of Spark? I think this appeared in 1.1. You don't usually package this class or MLlib, so your packaging probably is not relevant, but it has to be available at runtime on your cluster then. On Mon, Dec 22, 2014 at 10:16 AM, shkesar shubhamke...@live.com wrote: I am trying to run the twitter classifier https://github.com/databricks/reference-apps A NoClasssDefFoundError pops up. I've checked the library that the HashingTF class file is there. Some stack overflow questions show that might be problem with packaging the class. Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/mllib/feature/HashingTF at com.databricks.apps.twitter_classifier.Utils$.init(Utils.scala:12) at com.databricks.apps.twitter_classifier.Utils$.clinit(Utils.scala) at com.databricks.apps.twitter_classifier.Collect$.main(Collect.scala:26) at com.databricks.apps.twitter_classifier.Collect.main(Collect.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) Caused by: java.lang.ClassNotFoundException: org.apache.spark.mllib.feature.HashingTF at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 9 more -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-problems-in-packaging-mlllib-tp20810.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Using more cores on machines
Hi, Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 4 cores to a streaming application. I can do this via spark submit by: spark-submit --total-executor-cores 4 However, this assigns one core per machine. I would like to use 2 cores on 2 machines instead, leaving the other two machines untouched. Is this possible? Is there a downside to doing this? My thinking is that I should be able to reduce quite a bit of network traffic if all machines are not involved. Thanks, Ashic.
Re: Using more cores on machines
I think you want: --num-executors 2 --executor-cores 2 On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote: Hi, Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 4 cores to a streaming application. I can do this via spark submit by: spark-submit --total-executor-cores 4 However, this assigns one core per machine. I would like to use 2 cores on 2 machines instead, leaving the other two machines untouched. Is this possible? Is there a downside to doing this? My thinking is that I should be able to reduce quite a bit of network traffic if all machines are not involved. Thanks, Ashic. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Using more cores on machines
Hi Sean, Thanks for the response. It seems --num-executors is ignored. Specifying --num-executors 2 --executor-cores 2 is giving the app all 8 cores across 4 machines. -Ashic. From: so...@cloudera.com Date: Mon, 22 Dec 2014 10:57:31 + Subject: Re: Using more cores on machines To: as...@live.com CC: user@spark.apache.org I think you want: --num-executors 2 --executor-cores 2 On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote: Hi, Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 4 cores to a streaming application. I can do this via spark submit by: spark-submit --total-executor-cores 4 However, this assigns one core per machine. I would like to use 2 cores on 2 machines instead, leaving the other two machines untouched. Is this possible? Is there a downside to doing this? My thinking is that I should be able to reduce quite a bit of network traffic if all machines are not involved. Thanks, Ashic. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.sql.SQLException: No suitable driver found
Here is a script I use to submit a directory of jar files. It assumes jar files are in target/dependency or lib/ DRIVER_PATH= DEPEND_PATH= if [ -d lib ]; then DRIVER_PATH=lib DEPEND_PATH=lib else DRIVER_PATH=target DEPEND_PATH=target/dependency fi DEPEND_JARS=log4j.properties for f in `ls $DEPEND_PATH`; do DEPEND_JARS=$DEPEND_JARS,$DEPEND_PATH/$f; done $SPARK_HOME/bin/spark-submit \ --class $1 \ --master yarn-client \ --num-executors 1 \ --driver-memory 4g \ --executor-memory 16g \ --executor-cores 4 \ --jars $DEPEND_JARS \ $DRIVER_PATH/core-ingest-*.jar ${*:2} You would run it with a command like: ./run.sh class.to.submit arg1 arg2 … On Dec 22, 2014, at 1:11 AM, durga durgak...@gmail.com wrote: One more question. How would I submit additional jars to the spark-submit job. I used --jars option, it seems it is not working as explained earlier. Thanks for the help, -D -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-sql-SQLException-No-suitable-driver-found-tp20792p20805.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: locality sensitive hashing for spark
The implementation closely aligns with jaccard. It should be possible to swap out the hash functions to a family that is compatible with other distance measures. On Dec 22, 2014, at 1:16 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Looks interesting thanks for sharing. Does it support cosine similarity ? I only saw jaccard mentioned from a quick glance. — Sent from Mailbox https://www.dropbox.com/mailbox On Mon, Dec 22, 2014 at 4:12 AM, morr0723 michael.d@gmail.com mailto:michael.d@gmail.com wrote: I've pushed out an implementation of locality sensitive hashing for spark. LSH has a number of use cases, most prominent being if the features are not based in Euclidean space. Code, documentation, and small exemplar dataset is available on github: https://github.com/mrsqueeze/spark-hash Feel free to pass along any comments or issues. Enjoy! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/locality-sensitive-hashing-for-spark-tp20803.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: S3 files , Spark job hungsup
Is it possible too many connections open to read from s3 from one node? I have this issue before because I open a few hundreds of files on s3 to read from one node. It just block itself without error until timeout later. On Monday, December 22, 2014, durga durgak...@gmail.com wrote: Hi All, I am facing a strange issue sporadically. occasionally my spark job is hungup on reading s3 files. It is not throwing exception . or making some progress, it is just hungs up there. Is this a known issue , Please let me know how could I solve this issue. Thanks, -D -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3-files-Spark-job-hungsup-tp20806.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: user-h...@spark.apache.org javascript:;
Can Spark SQL thrift server UI provide JOB kill operate or any REST API?
Hello everyone! Like the title. I start the Spark SQL 1.2.0 thrift server. Use beeline connect to the server to execute SQL. I want to kill one SQL job running in the thrift server and not kill the thrift server. I set property spark.ui.killEnabled=true in spark-default.conf But in the UI, only stages can be killed, and the job can’t be killed! Is any way to kill the SQL job in the thrift server? Xiaoyu Wang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fetch Failure
Which version of spark are you running? It could be related to this https://issues.apache.org/jira/browse/SPARK-3633 fixed in 1.1.1 and 1.2.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Effects problems in logistic regression
Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me! De: Franco Barrientos [mailto:franco.barrien...@exalitica.com] Enviado el: jueves, 18 de diciembre de 2014 16:42 Para: 'DB Tsai' CC: 'Sean Owen'; user@spark.apache.org Asunto: RE: Effects problems in logistic regression Thanks I will try. De: DB Tsai [mailto:dbt...@dbtsai.com] Enviado el: jueves, 18 de diciembre de 2014 16:24 Para: Franco Barrientos CC: Sean Owen; user@spark.apache.org mailto:user@spark.apache.org Asunto: Re: Effects problems in logistic regression Can you try LogisticRegressionWithLBFGS? I verified that this will be converged to the same result trained by R's glmnet package without regularization. The problem of LogisticRegressionWithSGD is it's very slow in term of converging, and lots of time, it's very sensitive to stepsize which can lead to wrong answer. The regularization logic in MLLib is not entirely correct, and it will penalize the intercept. In general, with really high regularization, all the coefficients will be zeros except the intercept. In logistic regression, the non-zero intercept can be understood as the prior-probability of each class, and in linear regression, this will be the mean of response. I'll have a PR to fix this issue. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com wrote: Yes, without the “amounts” variables the results are similiar. When I put other variables its fine. De: Sean Owen [mailto:so...@cloudera.com mailto:so...@cloudera.com ] Enviado el: jueves, 18 de diciembre de 2014 14:22 Para: Franco Barrientos CC: user@spark.apache.org mailto:user@spark.apache.org Asunto: Re: Effects problems in logistic regression Are you sure this is an apples-to-apples comparison? for example does your SAS process normalize or otherwise transform the data first? Is the optimization configured similarly in both cases -- same regularization, etc.? Are you sure you are pulling out the intercept correctly? It is a separate value from the logistic regression model in Spark. On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos franco.barrien...@exalitica.com wrote: Hi all!, I have a problem with LogisticRegressionWithSGD, when I train a data set with one variable (wich is a amount of an item) and intercept, I get weights of (-0.4021,-207.1749) for both features, respectively. This don´t make sense to me because I run a logistic regression for the same data in SAS and I get these weights (-2.6604,0.000245). The rank of this variable is from 0 to 59102 with a mean of 1158. The problem is when I want to calculate the probabilities for each user from data set, this probability is near to zero or zero in much cases, because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in fact infinity for spark. How can I treat this variable? or why this happened? Thanks , Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 tel:%28%2B562%29-29699649 (+569)-76347893 tel:%28%2B569%29-76347893 franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com www.exalitica.com http://www.exalitica.com/ http://exalitica.com/web/img/frim.png
Spark exception when sending message to akka actor
Hi All, I have akka remote actors running on 2 nodes. I submitted spark application from node1. In the spark code, in one of the rdd, i am sending message to actor running on node1. My Spark code is as follows: class ActorClient extends Actor with Serializable { import context._ val currentActor: ActorSelection = context.system.actorSelection(akka.tcp:// ActorSystem@192.168.145.183:2551/user/MasterActor) implicit val timeout = Timeout(10 seconds) def receive = { case msg:String = { if(msg.contains(Spark)) { currentActor ! msg sender ! Local } else { println(Received..+msg) val future=currentActor ? msg val result = Await.result(future, timeout.duration).asInstanceOf[String] if(result.contains(ACK)) sender ! OK } } case PoisonPill = context.stop(self) } } object SparkExec extends Serializable { implicit val timeout = Timeout(10 seconds) val actorSystem=ActorSystem(ClientActorSystem) val actor=actorSystem.actorOf(Props(classOf[ActorClient]),name=ClientActor) def main(args:Array[String]) = { val conf = new SparkConf().setAppName(DeepLearningSpark) val sc=new SparkContext(conf) val textrdd=sc.textFile(hdfs://IMPETUS-DSRV02:9000/deeplearning/sample24k.csv) val rdd1=textrddmap{ line = println(In Map...) val future = actor ? Hello..Spark val result = Await.result(future,timeout.duration).asInstanceOf[String] if(result.contains(Local)){ println(Recieved in map+result) //actorSystem.shutdown } (10) } val rdd2=rdd1.map{ x = val future=actor ? Done val result = Await.result(future, timeout.duration).asInstanceOf[String] if(result.contains(OK)) { actorSystem.stop(remoteActor) actorSystem.shutdown } (2) } rdd2.saveAsTextFile(/home/padma/SparkAkkaOut) } } In my ActorClientActor, through actorSelection, identifying the remote actor and sending the message. Once the messages are sent, in *rdd2*, after receiving ack from remote actor, i am killing the actor ActorClient and shutting down the ActorSystem. The above code is throwing the following exception: 14/12/22 19:04:36 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.ExceptionInInitializerError: com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:166) com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:159) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) 14/12/22 19:04:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in): java.lang.NoClassDefFoundError: Could not initialize class com.impetus.spark.SparkExec$ com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:166) com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:159) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
Re: Effects problems in logistic regression
Sounds great. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 22, 2014 at 5:27 AM, Franco Barrientos franco.barrien...@exalitica.com wrote: Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me! *De:* Franco Barrientos [mailto:franco.barrien...@exalitica.com] *Enviado el:* jueves, 18 de diciembre de 2014 16:42 *Para:* 'DB Tsai' *CC:* 'Sean Owen'; user@spark.apache.org *Asunto:* RE: Effects problems in logistic regression Thanks I will try. *De:* DB Tsai [mailto:dbt...@dbtsai.com dbt...@dbtsai.com] *Enviado el:* jueves, 18 de diciembre de 2014 16:24 *Para:* Franco Barrientos *CC:* Sean Owen; user@spark.apache.org *Asunto:* Re: Effects problems in logistic regression Can you try LogisticRegressionWithLBFGS? I verified that this will be converged to the same result trained by R's glmnet package without regularization. The problem of LogisticRegressionWithSGD is it's very slow in term of converging, and lots of time, it's very sensitive to stepsize which can lead to wrong answer. The regularization logic in MLLib is not entirely correct, and it will penalize the intercept. In general, with really high regularization, all the coefficients will be zeros except the intercept. In logistic regression, the non-zero intercept can be understood as the prior-probability of each class, and in linear regression, this will be the mean of response. I'll have a PR to fix this issue. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos franco.barrien...@exalitica.com wrote: Yes, without the “amounts” variables the results are similiar. When I put other variables its fine. *De:* Sean Owen [mailto:so...@cloudera.com] *Enviado el:* jueves, 18 de diciembre de 2014 14:22 *Para:* Franco Barrientos *CC:* user@spark.apache.org *Asunto:* Re: Effects problems in logistic regression Are you sure this is an apples-to-apples comparison? for example does your SAS process normalize or otherwise transform the data first? Is the optimization configured similarly in both cases -- same regularization, etc.? Are you sure you are pulling out the intercept correctly? It is a separate value from the logistic regression model in Spark. On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos franco.barrien...@exalitica.com wrote: Hi all!, I have a problem with LogisticRegressionWithSGD, when I train a data set with one variable (wich is a amount of an item) and intercept, I get weights of (-0.4021,-207.1749) for both features, respectively. This don´t make sense to me because I run a logistic regression for the same data in SAS and I get these weights (-2.6604,0.000245). The rank of this variable is from 0 to 59102 with a mean of 1158. The problem is when I want to calculate the probabilities for each user from data set, this probability is near to zero or zero in much cases, because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in fact infinity for spark. How can I treat this variable? or why this happened? Thanks , *Franco Barrientos* Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.com www.exalitica.com [image: http://exalitica.com/web/img/frim.png]
Tuning Spark Streaming jobs
Hi, After facing issues with the performance of some of our Spark Streaming jobs, we invested quite some effort figuring out the factors that affect the performance characteristics of a Streaming job. We defined an empirical model that helps us reason about Streaming jobs and applied it to tune the jobs in order to maximize throughput. We have summarized our findings in a blog post with the intention of collecting feedback and hoping that it is useful to other Spark Streaming users facing similar issues. http://www.virdata.com/tuning-spark/ Your feedback is welcome. With kind regards, Gerard. Data Processing Team Lead Virdata.com @maasg
RE: Using more cores on machines
Hi Josh, I'm not looking to change the 1:1 ratio. What I'm trying to do is get both cores on two machines working, rather than one core on all four machines. With --total-executor-cores 4, I have 1 core per machine working for an app. I'm looking for something that'll let me use 2 cores per machine on 2 machines (so 4 cores in total) while not using the other two machines. Regards, Ashic. From: j...@soundcloud.com Date: Mon, 22 Dec 2014 17:36:26 +0100 Subject: Re: Using more cores on machines To: as...@live.com CC: so...@cloudera.com; user@spark.apache.org AFAIK, `--num-executors` is not available for standalone clusters. In standalone mode, you must start new workers on your node as it is a 1:1 ratio of workers to executors. On 22 December 2014 at 12:25, Ashic Mahtab as...@live.com wrote: Hi Sean, Thanks for the response. It seems --num-executors is ignored. Specifying --num-executors 2 --executor-cores 2 is giving the app all 8 cores across 4 machines. -Ashic. From: so...@cloudera.com Date: Mon, 22 Dec 2014 10:57:31 + Subject: Re: Using more cores on machines To: as...@live.com CC: user@spark.apache.org I think you want: --num-executors 2 --executor-cores 2 On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote: Hi, Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 4 cores to a streaming application. I can do this via spark submit by: spark-submit --total-executor-cores 4 However, this assigns one core per machine. I would like to use 2 cores on 2 machines instead, leaving the other two machines untouched. Is this possible? Is there a downside to doing this? My thinking is that I should be able to reduce quite a bit of network traffic if all machines are not involved. Thanks, Ashic. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib, classification label problem
Yeah, it's mentioned in the doc: Note that, in the mathematical formulation in this guide, a training label y is denoted as either +1 (positive) or −1 (negative), which is convenient for the formulation. However, the negative label is represented by 0 in MLlib instead of −1, to be consistent with multiclass labeling. Both are valid and equally correct, although the two conventions lead to different expressions for the gradients and loss functions. I also find it is a little confusing since the docs explain one form, and the code implements another form (except some examples, which actually reimplement with the -1/+1 convention). I personally am also more used to the forms corresponding to 0 for the negative class, but I'm sure some will say they're more accustomed to the other convention. On Mon, Dec 22, 2014 at 4:02 PM, Hao Ren inv...@gmail.com wrote: Hi, When going through the MLlib doc for classification: http://spark.apache.org/docs/latest/mllib-linear-methods.html, I find that the loss functions are based on label {1, -1}. But in MLlib, the loss functions on label {1, 0} are used. And there is a dataValidation check before fitting, if a label is other than 0 or 1, an exception will be thrown. I don't understand the intention here. Could someone explain this ? Hao. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-classification-label-problem-tp20813.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using more cores on machines
If you are looking to reduce network traffic then setting spark.deploy.spreadOut to false may help. On Mon, Dec 22, 2014 at 11:44 AM, Ashic Mahtab as...@live.com wrote: Hi Josh, I'm not looking to change the 1:1 ratio. What I'm trying to do is get both cores on two machines working, rather than one core on all four machines. With --total-executor-cores 4, I have 1 core per machine working for an app. I'm looking for something that'll let me use 2 cores per machine on 2 machines (so 4 cores in total) while not using the other two machines. Regards, Ashic. From: j...@soundcloud.com Date: Mon, 22 Dec 2014 17:36:26 +0100 Subject: Re: Using more cores on machines To: as...@live.com CC: so...@cloudera.com; user@spark.apache.org AFAIK, `--num-executors` is not available for standalone clusters. In standalone mode, you must start new workers on your node as it is a 1:1 ratio of workers to executors. On 22 December 2014 at 12:25, Ashic Mahtab as...@live.com wrote: Hi Sean, Thanks for the response. It seems --num-executors is ignored. Specifying --num-executors 2 --executor-cores 2 is giving the app all 8 cores across 4 machines. -Ashic. From: so...@cloudera.com Date: Mon, 22 Dec 2014 10:57:31 + Subject: Re: Using more cores on machines To: as...@live.com CC: user@spark.apache.org I think you want: --num-executors 2 --executor-cores 2 On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote: Hi, Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 4 cores to a streaming application. I can do this via spark submit by: spark-submit --total-executor-cores 4 However, this assigns one core per machine. I would like to use 2 cores on 2 machines instead, leaving the other two machines untouched. Is this possible? Is there a downside to doing this? My thinking is that I should be able to reduce quite a bit of network traffic if all machines are not involved. Thanks, Ashic. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: S3 files , Spark job hungsup
Yes . I am reading thousands of files every hours. Is there any way I can tell spark to timeout. Thanks for your help. -D On Mon, Dec 22, 2014 at 4:57 AM, Shuai Zheng szheng.c...@gmail.com wrote: Is it possible too many connections open to read from s3 from one node? I have this issue before because I open a few hundreds of files on s3 to read from one node. It just block itself without error until timeout later. On Monday, December 22, 2014, durga durgak...@gmail.com wrote: Hi All, I am facing a strange issue sporadically. occasionally my spark job is hungup on reading s3 files. It is not throwing exception . or making some progress, it is just hungs up there. Is this a known issue , Please let me know how could I solve this issue. Thanks, -D -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3-files-Spark-job-hungsup-tp20806.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: custom python converter from HBase Result to tuple
Which HBase version are you using ? Can you show the full stack trace ? Cheers On Mon, Dec 22, 2014 at 11:02 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, can anyone please give me some help how to write custom converter of hbase data to (for example) tuples of ((family, qualifier, value), ) for pyspark: I was trying something like (here trying to tuples of (family:qualifier:value, )): class HBaseResultToTupleConverter extends Converter[Any, List[String]] { override def convert(obj: Any): List[String] = { val result = obj.asInstanceOf[Result] result.rawCells().map(cell = List(Bytes.toString(CellUtil.cloneFamily(cell)), Bytes.toString(CellUtil.cloneQualifier(cell)), Bytes.toString(CellUtil.cloneValue(cell))).mkString(:) ).toList } } but then I get a error: 14/12/22 16:27:40 WARN python.SerDeUtil: Failed to pickle Java object as value: $colon$colon, falling back to 'toString'. Error: couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number of arguments does anyone have a hint? Thanks, Antony. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Tuning Spark Streaming jobs
Hi Tim, That would be awesome. We have seen some really disparate Mesos allocations for our Spark Streaming jobs. (like (7,4,1) over 3 executors for 4 kafka consumer instead of the ideal (3,3,3,3)) For network dependent consumers, achieving an even deployment would provide a reliable and reproducible streaming job execution from the performance point of view. We're deploying in coarse grain mode. Not sure Spark Streaming would work well in fine-grained given the added latency to acquire a worker. You mention that you're changing the Mesos scheduler. Is there a Jira where this job is taking place? -kr, Gerard. On Mon, Dec 22, 2014 at 6:01 PM, Timothy Chen tnac...@gmail.com wrote: Hi Gerard, Really nice guide! I'm particularly interested in the Mesos scheduling side to more evenly distribute cores across cluster. I wonder if you are using coarse grain mode or fine grain mode? I'm making changes to the spark mesos scheduler and I think we can propose a best way to achieve what you mentioned. Tim Sent from my iPhone On Dec 22, 2014, at 8:33 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi, After facing issues with the performance of some of our Spark Streaming jobs, we invested quite some effort figuring out the factors that affect the performance characteristics of a Streaming job. We defined an empirical model that helps us reason about Streaming jobs and applied it to tune the jobs in order to maximize throughput. We have summarized our findings in a blog post with the intention of collecting feedback and hoping that it is useful to other Spark Streaming users facing similar issues. http://www.virdata.com/tuning-spark/ Your feedback is welcome. With kind regards, Gerard. Data Processing Team Lead Virdata.com @maasg
Announcing Spark Packages
Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I’d like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: does spark sql support columnar compression with encoding when caching tables
Thanks Cheng, Michael - that was super helpful. On Sun, Dec 21, 2014 at 7:27 AM, Cheng Lian lian.cs@gmail.com wrote: Would like to add that compression schemes built in in-memory columnar storage only supports primitive columns (int, string, etc.), complex types like array, map and struct are not supported. On 12/20/14 6:17 AM, Sadhan Sood wrote: Hey Michael, Thank you for clarifying that. Is tachyon the right way to get compressed data in memory or should we explore the option of adding compression to cached data. This is because our uncompressed data set is too big to fit in memory right now. I see the benefit of tachyon not just with storing compressed data in memory but we wouldn't have to create a separate table for caching some partitions like 'cache table table_cached as select * from table where date = 201412XX' - the way we are doing right now. On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust mich...@databricks.com wrote: There is only column level encoding (run length encoding, delta encoding, dictionary encoding) and no generic compression. On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood sadhan.s...@gmail.com wrote: Hi All, Wondering if when caching a table backed by lzo compressed parquet data, if spark also compresses it (using lzo/gzip/snappy) along with column level encoding or just does the column level encoding when *spark.sql.inMemoryColumnarStorage.compressed *is set to true. This is because when I try to cache the data, I notice the memory being used is almost as much as the uncompressed size of the data. Thanks!
Re: UNION two RDDs
Hi Sean and Madhu, Thank you for the explanation. I really appreciate it. Best Regards, Jerry On Fri, Dec 19, 2014 at 4:50 AM, Sean Owen so...@cloudera.com wrote: coalesce actually changes the number of partitions. Unless the original RDD had just 1 partition, coalesce(1) will make an RDD with 1 partition that is larger than the original partitions, of course. I don't think the question is about ordering of things within an element of the RDD? If the original RDD was sorted, and so has a defined ordering, then it will be preserved. Otherwise I believe you do not have any guarantees about ordering. In practice, you may find that you still encounter the elements in the same order after coalesce(1), although I am not sure that is even true. union() is the same story; unless the RDDs are sorted I don't think there are guarantees. However I'm almost certain that in practice, as it happens now, A's elements would come before B's after a union, if you did traverse them. On Fri, Dec 19, 2014 at 5:41 AM, madhu phatak phatak@gmail.com wrote: Hi, coalesce is an operation which changes no of records in a partition. It will not touch ordering with in a row AFAIK. On Fri, Dec 19, 2014 at 2:22 AM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users, I wonder if val resultRDD = RDDA.union(RDDB) will always have records in RDDA before records in RDDB. Also, will resultRDD.coalesce(1) change this ordering? Best Regards, Jerry -- Regards, Madhukara Phatak http://www.madhukaraphatak.com
MLLib beginner question
Hi! I want to try out spark mllib in my spark project, but I got a little problem. I have training data (external file), but the real data com from another rdd. How can I do that? I try to simple using same SparkContext to boot rdd (first I create rdd using sc.textFile() and after NaiveBayes.train... After that I want to fetch the real data using same context and internal the map using the predict. But My application never exit (I think stucked or something). Why not work this solution? Thanks b0c1 -- Skype: boci13, Hangout: boci.b...@gmail.com
Re: spark-repl_1.2.0 was not uploaded to central maven repository.
Just closing the loop -- FWIW this was indeed on purpose -- https://issues.apache.org/jira/browse/SPARK-3452 . I take it that it's not encouraged to depend on the REPL as a module. On Sun, Dec 21, 2014 at 10:34 AM, Sean Owen so...@cloudera.com wrote: I'm only speculating, but I wonder if it was on purpose? would people ever build an app against the REPL? On Sun, Dec 21, 2014 at 5:50 AM, Peng Cheng pc...@uow.edu.au wrote: Everything else is there except spark-repl. Can someone check that out this weekend? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-repl-1-2-0-was-not-uploaded-to-central-maven-repository-tp20799.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Long-running job cleanup
Hi all, I have a long running job iterating over a huge dataset. Parts of this operation are cached. Since the job runs for so long, eventually the overhead of spark shuffles starts to accumulate culminating in the driver starting to swap. I am aware of the spark.cleanup.tll parameter that allows me to configure when cleanup happens but the issue with doing this is that it isn’t done safely, e.g. I can be in the middle of processing a stage when this cleanup happens and my cached RDDs get cleared. This ultimately causes a KeyNotFoundException when I try to reference the now cleared cached RDD. This behavior doesn’t make much sense to me, I would expect the cached RDD to either get regenerated or at the very least for there to be an option to execute this cleanup without deleting those RDDs. Is there a programmatically safe way of doing this cleanup that doesn’t break everything? If I instead tear down the spark context and bring up a new context for every iteration (assuming that each iteration is sufficiently long-lived), would memory get released appropriately? The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: Spark in Standalone mode
Please check your spark version and hadoop version in your mvn as well as local spark setup. If hadoop versions not matching you might get this issue. Thanks, -D -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-in-Standalone-mode-tp20780p20815.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-repl_1.2.0 was not uploaded to central maven repository.
Thanks a lot for point it out. I also found it in pom.xml. A new ticket for reverting it has been submitted: https://issues.apache.org/jira/browse/SPARK-4923 At first I assume that further development on it has been moved to databricks cloud. But the JIRA ticket was already there in September. So maybe demand on this API from the community is indeed low enough. However, I would still suggest keeping it, even promoting it into a Developer's API, this would encourage more projects to integrate in a more flexible way, and save prototyping/QA cost by customizing fixtures of REPL. People will still move to databricks cloud, which has far more features than that. Many influential projects already depends on the routinely published Scala-REPL (e.g. playFW), it would be strange for Spark not doing the same. What do you think? Yours Peng On 12/22/2014 04:57 PM, Sean Owen wrote: Just closing the loop -- FWIW this was indeed on purpose -- https://issues.apache.org/jira/browse/SPARK-3452 . I take it that it's not encouraged to depend on the REPL as a module. On Sun, Dec 21, 2014 at 10:34 AM, Sean Owen so...@cloudera.com wrote: I'm only speculating, but I wonder if it was on purpose? would people ever build an app against the REPL? On Sun, Dec 21, 2014 at 5:50 AM, Peng Cheng pc...@uow.edu.au wrote: Everything else is there except spark-repl. Can someone check that out this weekend? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-repl-1-2-0-was-not-uploaded-to-central-maven-repository-tp20799.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Announcing Spark Packages
Me 2 :) On 12/22/2014 06:14 PM, Andrew Ash wrote: Hi Xiangrui, That link is currently returning a 503 Over Quota error message. Would you mind pinging back out when the page is back up? Thanks! Andrew On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com mailto:men...@gmail.com wrote: Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I’d like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Announcing Spark Packages
Hello Xiangrui, If you have not already done so, you should look at http://www.apache.org/foundation/marks/#domains for the policy on use of ASF trademarked terms in domain names. thanks — Hitesh On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I’d like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can Spark SQL thrift server UI provide JOB kill operate or any REST API?
I would expect that killing a stage would kill the whole job. Are you not seeing that happen? On Mon, Dec 22, 2014 at 5:09 AM, Xiaoyu Wang wangxy...@gmail.com wrote: Hello everyone! Like the title. I start the Spark SQL 1.2.0 thrift server. Use beeline connect to the server to execute SQL. I want to kill one SQL job running in the thrift server and not kill the thrift server. I set property spark.ui.killEnabled=true in spark-default.conf But in the UI, only stages can be killed, and the job can’t be killed! Is any way to kill the SQL job in the thrift server? Xiaoyu Wang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Interpreting MLLib's linear regression o/p
Did you check the indices in the LIBSVM data and the master file? Do they match? -Xiangrui On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I use LIBSVM format to specify my input feature vector, which used 1-based index. When I run regression the o/p is 0-indexed based. I have a master lookup file that maps back these indices to what they stand or. However, I need to add offset of 2 and not 1 to the regression outcome during the mapping. So for example to map the index of 800 from the regression output file, I look for 802 in my master lookup file and then things make sense. I can understand adding offset of 1, but not sure why adding offset 2 is working fine. Have others seem something like this as well? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib beginner question
How big is the dataset you want to use in prediction? -Xiangrui On Mon, Dec 22, 2014 at 1:47 PM, boci boci.b...@gmail.com wrote: Hi! I want to try out spark mllib in my spark project, but I got a little problem. I have training data (external file), but the real data com from another rdd. How can I do that? I try to simple using same SparkContext to boot rdd (first I create rdd using sc.textFile() and after NaiveBayes.train... After that I want to fetch the real data using same context and internal the map using the predict. But My application never exit (I think stucked or something). Why not work this solution? Thanks b0c1 -- Skype: boci13, Hangout: boci.b...@gmail.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Interpreting MLLib's linear regression o/p
Hi,It is a text format in which each line represents a labeled sparse feature vector using the following format:label index1:value1 index2:value2 ...This was the confusing part in the documentation: where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based. Let us say that I have 40 features so I create an index file like this: Feature, index number:F1 1F2 2F3 3...F4 40 I then create my feature vectors and in the libsvm format something like:1 10:1 20:0 8:1 4:0 24:11 1:1 40:0 2:1 8:0 9:1 23:10 23:1 18:0 13:1. I run regression and get back models.weights which are 40 weights.Say I get 0.110.34450.5... In that case does the first weight (0.11) correspond to index 1/ F1 or does or correspond to index 2/F2? Since Input is 1-based and o/p is 0-based. Or is 0-based indexing is only for internal representation and what you get back at the end of regression is essentially 1-based indexed like your input so 0.11 maps onto from F1and so on? Date: Mon, 22 Dec 2014 16:31:57 -0800 Subject: Re: Interpreting MLLib's linear regression o/p From: men...@gmail.com To: ssti...@live.com CC: user@spark.apache.org Did you check the indices in the LIBSVM data and the master file? Do they match? -Xiangrui On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I use LIBSVM format to specify my input feature vector, which used 1-based index. When I run regression the o/p is 0-indexed based. I have a master lookup file that maps back these indices to what they stand or. However, I need to add offset of 2 and not 1 to the regression outcome during the mapping. So for example to map the index of 800 from the regression output file, I look for 802 in my master lookup file and then things make sense. I can understand adding offset of 1, but not sure why adding offset 2 is working fine. Have others seem something like this as well? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Announcing Spark Packages
Hitesh, From your link http://www.apache.org/foundation/marks/#domains: You may not use ASF trademarks such as “Apache” or “ApacheFoo” or “Foo” in your own domain names if that use would be likely to confuse a relevant consumer about the source of software or services provided through your website, without written approval of the VP, Apache Brand Management or designee. The title on the packages website is “A community index of packages for Apache Spark.” Furthermore, the footnote of the website reads “Spark Packages is a community site hosting modules that are not part of Apache Spark.” I think there’s nothing on there that would “confuse a relevant consumer about the source of software”. It’s pretty clear that the Spark Packages name is well within the ASF’s guidelines. Have I misunderstood the ASF’s policy? Nick On Mon Dec 22 2014 at 6:40:10 PM Hitesh Shah hit...@apache.org wrote: Hello Xiangrui, If you have not already done so, you should look at http://www.apache.org/ foundation/marks/#domains for the policy on use of ASF trademarked terms in domain names. thanks — Hitesh On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I’d like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Announcing Spark Packages
Okie doke! (I just assumed there was an issue since the policy was brought up.) On Mon Dec 22 2014 at 8:33:53 PM Patrick Wendell pwend...@gmail.com wrote: Hey Nick, I think Hitesh was just trying to be helpful and point out the policy - not necessarily saying there was an issue. We've taken a close look at this and I think we're in good shape her vis-a-vis this policy. - Patrick On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hitesh, From your link: You may not use ASF trademarks such as Apache or ApacheFoo or Foo in your own domain names if that use would be likely to confuse a relevant consumer about the source of software or services provided through your website, without written approval of the VP, Apache Brand Management or designee. The title on the packages website is A community index of packages for Apache Spark. Furthermore, the footnote of the website reads Spark Packages is a community site hosting modules that are not part of Apache Spark. I think there's nothing on there that would confuse a relevant consumer about the source of software. It's pretty clear that the Spark Packages name is well within the ASF's guidelines. Have I misunderstood the ASF's policy? Nick On Mon Dec 22 2014 at 6:40:10 PM Hitesh Shah hit...@apache.org wrote: Hello Xiangrui, If you have not already done so, you should look at http://www.apache.org/foundation/marks/#domains for the policy on use of ASF trademarked terms in domain names. thanks -- Hitesh On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Dear Spark users and developers, I'm happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I'd like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark streaming python + kafka
There is a WIP pull request[1] working on this, it should be merged into master soon. [1] https://github.com/apache/spark/pull/3715 On Fri, Dec 19, 2014 at 2:15 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I've just seen that streaming spark supports python from 1.2 version. Question, does spark streaming (python version ) supports kafka integration? Thanks Oleg. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Who manage the log4j appender while running spark on yarn?
After some discussions with Hadoop guys, I got how the mechanism works. If we don't add -Dlog4j.configuration into java options to the container(AM or executors), they will use log4j.properties(if any) under container's classpath(extraClasspath plus yarn.application.classpath). If we wanna custom our log4j configuration, we should add spark.executor.extraJavaOptions=-Dlog4j.configuration=/path/to/log4j.properties or spark.yarn.am.extraJavaOptions=-Dlog4j.configuration=/path/to/log4j.properties in spark-defaults.conf file. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Who-manage-the-log4j-appender-while-running-spark-on-yarn-tp20778p20818.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Who manage the log4j appender while running spark on yarn?
If you don't specify your own log4j.properties, Spark will load the default one (from core/src/main/resources/org/apache/spark/log4j-defaults.properties, which ends up being packaged with the Spark assembly). You can easily override the config file if you want to, though; check the Debugging section of the Running on YARN docs. On Fri, Dec 19, 2014 at 12:37 AM, WangTaoTheTonic barneystin...@aliyun.com wrote: Hi guys, I recently ran spark on yarn and found spark didn't set any log4j properties file in configuration or code. And the log4j logs was writing into stderr file under ${yarn.nodemanager.log-dirs}/application_${appid}. I wanna know which side(spark or hadoop) controll the appender? Have found that related disscussion here: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logging-strategy-on-YARN-td8751.html, but I think spark code has changed a lot since then. Any one could offer some guide? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Who-manage-the-log4j-appender-while-running-spark-on-yarn-tp20778.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
broadcasting object issue
Hi All, I have a problem with broadcasting a serialize class object that returned by another not-serialize class, here is the sample code: class A extends java.io.Serializable { def halo(): String = halo } class B { def getA() = new A } val list = List(1) val b = new B val a = b.getA val p = sc.parallelize(list) // this will fail val bcA = sc.broadcast(a) p.map(x = { bcA.value.halo() }) // this will success val bcA = sc.broadcast(new A) p.map(x = { bcA.value.halo() }) A is a serializable class, where B is not-serialize. If I create a new object A through B method getA(), the map process will failed with exception org.apache.spark.SparkException: Task not serializable, Caused by: java.io.NotSerializableException: $iwC$$iwC$B I don't know why spark will check if the B class serializable or not, is there a way to code this? Best regards, Henry The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Winbond is strictly prohibited; and any information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond.
Re: custom python converter from HBase Result to tuple
using hbase 0.98.6 there is no stack trace, just this short error. just noticed it does the fallback to toString as in the message as this is what I get back to python: hbase_rdd.collect() [(u'key1', u'List(cf1:12345:14567890, cf2:123:14567896)')] so the question is why it falls back to toString? thanks,Antony. On Monday, 22 December 2014, 20:09, Ted Yu yuzhih...@gmail.com wrote: Which HBase version are you using ? Can you show the full stack trace ? Cheers On Mon, Dec 22, 2014 at 11:02 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, can anyone please give me some help how to write custom converter of hbase data to (for example) tuples of ((family, qualifier, value), ) for pyspark: I was trying something like (here trying to tuples of (family:qualifier:value, )): class HBaseResultToTupleConverter extends Converter[Any, List[String]] { override def convert(obj: Any): List[String] = { val result = obj.asInstanceOf[Result] result.rawCells().map(cell = List(Bytes.toString(CellUtil.cloneFamily(cell)), Bytes.toString(CellUtil.cloneQualifier(cell)), Bytes.toString(CellUtil.cloneValue(cell))).mkString(:) ).toList } } but then I get a error: 14/12/22 16:27:40 WARN python.SerDeUtil: Failed to pickle Java object as value: $colon$colon, falling back to 'toString'. Error: couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number of arguments does anyone have a hint? Thanks, Antony. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Joins in Spark
Hi, I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair RDD. I want to take three way join of these two. Joins work only when both the RDDs are pair RDDS right? So, how am I supposed to take a three way join of these RDDs? Thank You
Joins in Spark
Hi, I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair RDD. I want to take three way join of these two. Joins work only when both the RDDs are pair RDDS right? So, how am I supposed to take a three way join of these RDDs? Thank You
Re: broadcasting object issue
Hi, Just ran your code on spark-shell. If you replace val bcA = sc.broadcast(a) with val bcA = sc.broadcast(new B().getA) it seems to work. Not sure why. On Tue, Dec 23, 2014 at 9:12 AM, Henry Hung ythu...@winbond.com wrote: Hi All, I have a problem with broadcasting a serialize class object that returned by another not-serialize class, here is the sample code: class A extends java.io.Serializable { def halo(): String = halo } class B { def getA() = new A } val list = List(1) val b = new B val a = b.getA val p = sc.parallelize(list) // this will fail val bcA = sc.broadcast(a) p.map(x = { bcA.value.halo() }) // this will success val bcA = sc.broadcast(new A) p.map(x = { bcA.value.halo() }) A is a serializable class, where B is not-serialize. If I create a new object A through B method getA(), the map process will failed with exception “org.apache.spark.SparkException: Task not serializable, Caused by: java.io.NotSerializableException: $iwC$$iwC$B” I don’t know why spark will check if the B class serializable or not, is there a way to code this? Best regards, Henry -- The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Winbond is strictly prohibited; and any information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond. -- Regards, Madhukara Phatak http://www.madhukaraphatak.com
Re: Joins in Spark
Hi, You can map your vertices rdd as follow val pairVertices = verticesRDD.map(vertice = (vertice,null)) the above gives you a pairRDD. After join make sure that you remove superfluous null value. On Tue, Dec 23, 2014 at 10:36 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair RDD. I want to take three way join of these two. Joins work only when both the RDDs are pair RDDS right? So, how am I supposed to take a three way join of these RDDs? Thank You -- Regards, Madhukara Phatak http://www.madhukaraphatak.com
Joins in Spark
Hi, I have two RDDs, veritces which is an RDD and edges, which is a pair RDD. I have to do a three-way join of these two. Joins work only when both the RDDs are pair RDDs, so how can we perform a three-way join of these RDDs? Thank You -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joins-in-Spark-tp20819.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Fwd: Joins in Spark
This gives me two pair RDDs, one is the edgesRDD and another is verticesRDD with each vertex padded with value null. But I have to take a three way join of these two RDD and I have only one common attribute in these two RDDs. How can I go about doing the three join?
Consistent hashing of RDD row
Hello, I have a process where I need to create a random number for each row in an RDD. That new RDD will be used in a few iteration, and it is necessary that between iterations the numbers won't change (i.e., if a partition get evicted from the cache, the numbers of that partition will be regenerated the same) One way to solve it is to persist the RDD (after the random numbers are created) on the disk, but it might be evicted if we run out of space on the disk, no? My idea is to do zipWithIndex on my original RDD, and for each row, create a new random generator with the index as the seed. I would like to know if zipWithIndex will match the same index if its get evicted from the cache, for example: rdd1.join(rdd2).zipWithIndex() if the join gets recalculated, the rows will get the same index? or in: val rdd = hiveContext.sql(...).zipWithIndex() if the partitions of the query get evicted and recalculated, will the index stay the same? I'd love to hear your thoughts on the matter. Thanks, Lev. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Consistent-hashing-of-RDD-row-tp20820.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL with a sorted file
Michael, Thanks. Is this still turned off in the released 1.2? Is it possible to turn it on just to get an idea of how much of a difference it makes? -Jerry On 05/12/14 12:40 am, Michael Armbrust wrote: I'll add that some of our data formats will actual infer this sort of useful information automatically. Both parquet and cached inmemory tables keep statistics on the min/max value for each column. When you have predicates over these sorted columns, partitions will be eliminated if they can't possibly match the predicate given the statistics. For parquet this is new in Spark 1.2 and it is turned off by defaults (due to bugs we are working with the parquet library team to fix). Hopefully soon it will be on by default. On Wed, Dec 3, 2014 at 8:44 PM, Cheng, Hao hao.ch...@intel.com mailto:hao.ch...@intel.com wrote: You can try to write your own Relation with filter push down or use the ParquetRelation2 for workaround. (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala) Cheng Hao -Original Message- From: Jerry Raj [mailto:jerry@gmail.com mailto:jerry@gmail.com] Sent: Thursday, December 4, 2014 11:34 AM To: user@spark.apache.org mailto:user@spark.apache.org Subject: Spark SQL with a sorted file Hi, If I create a SchemaRDD from a file that I know is sorted on a certain field, is it possible to somehow pass that information on to Spark SQL so that SQL queries referencing that field are optimized? Thanks -Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org