Re: StreamingContext.textFileStream issue
I have no problem running the socket text stream sample in the same environment. Thanks Yang Sent from my iPhone On Apr 25, 2015, at 1:30 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Make sure you are having =2 core for your streaming application. Thanks Best Regards On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei genia...@gmail.com wrote: I hit the same issue as if the directory has no files at all when running the sample examples/src/main/python/streaming/hdfs_wordcount.py with a local directory, and adding file into that directory . Appreciate comments on how to resolve this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: StreamingContext.textFileStream issue
I hit the same issue as if the directory has no files at all when running the sample examples/src/main/python/streaming/hdfs_wordcount.py with a local directory, and adding file into that directory . Appreciate comments on how to resolve this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark on Mesos
I run my Spark over Mesos by either running spark submit in a Docker container using Marathon or from one of the node in mesos cluster. I am on mesos 0.21. I have tried both spark 1.3.1 and 1.2.1 with rebuild of hadoop 2.4 and above. Some details on the configuration: I made sure that spark is using ip addresses for all communication by defining spark.driver.host, SPARK_PUBLIC_DNS, SPARK_LOCAL_IP, SPARK_LOCAL_HOST in the right place. Hope this help. Yang. On Fri, Apr 24, 2015 at 5:15 PM, Stephen Carman scar...@coldlight.com wrote: So I can’t for the life of me to get something even simple working for Spark on Mesos. I installed a 3 master, 3 slave mesos cluster, which is all configured, but I can’t for the life of me even get the spark shell to work properly. I get errors like this org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor 20150424-104711-1375862026-5050-20113-S1 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) I tried both mesos 0.21 and 0.22 and they both produce the same error… My version of spark is 1.3.1 with hadoop 2.6, I just downloaded the pre-build from the site, or is that wrong and i have to build it myself? I have my mesos_native_java_library, spark executor URI and mesos master set in my spark-env.sh, they to the best of my abilities seem correct. Does anyone have any insight into this at all? I’m running this on red hat 7 with 8 CPU cores and 14gb of ram per slave, so 24 cores total and 42gb of ram total. Anyone have any idea at all what is going on here? Thanks, Steve This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Issue of running partitioned loading (RDD) in Spark External Datasource on Mesos
I implemented two kinds of DataSource, one load data during buildScan, the other returning my RDD class with partition information for future loading. My RDD's compute gets actorSystem from SparkEnv.get.actorSystem, then use Spray to interact with a HTTP endpoint, which is the same flow as loading data in buildScan. All the Spray dependencies are included in a jar and passes to spark-submit using --jar. The Job is define in python. Both scenarios work testing locally using --master local[4]. For mesos, the not partitioned loading works too, but the partitioned loading hits the following exception. Traceback (most recent call last): File /root/spark-1.3.1-bin-hadoop2.4/../CloudantApp.py, line 78, in module for code in airportData.collect(): File /root/spark-1.3.1-bin-hadoop2.4/python/pyspark/sql/dataframe.py, line 293, in collect port = self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd()) File /root/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o60.javaToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 36.0 failed 4 times, most recent failure: Lost task 0.3 in stage 36.0 (TID 147, 198.11.207.72): com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'spray' at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124) at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:147) at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159) at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164) at com.typesafe.config.impl.SimpleConfig.getObject(SimpleConfig.java:218) at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:224) at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:33) at spray.can.HttpExt.init(Http.scala:143) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78) at scala.util.Try$.apply(Try.scala:161) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73) at akka.actor.ExtensionKey.createExtension(Extension.scala:153) at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:711) at akka.actor.ExtensionId$class.apply(Extension.scala:79) at akka.actor.ExtensionKey.apply(Extension.scala:149) at akka.io.IO$.apply(IO.scala:30) at spray.client.pipelining$.sendReceive(pipelining.scala:35) at com.cloudant.spark.common.JsonStoreDataAccess.getQueryResult(JsonStoreDataAccess.scala:118) at com.cloudant.spark.common.JsonStoreDataAccess.getIterator(JsonStoreDataAccess.scala:71) at com.cloudant.spark.common.JsonStoreRDD.compute(JsonStoreRDD.scala:86) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Is this due to some kind of classpath setup issue on the executor for the external jar for handing RDD? Thanks in advance for any suggestions on how to resolve this. Yang
Cloudant as Spark SQL External Datastore on Spark 1.3.0
Check this out : https://github.com/cloudant/spark-cloudant. It supports both the DataFrame and SQL approach for reading data from Cloudant and save it . Looking forward to your feedback on the project. Yang
Re: Question on Spark 1.3 SQL External Datasource
Thanks Cheng for the clarification. Looking forward to this new API mentioned below. Yang Sent from my iPad On Mar 17, 2015, at 8:05 PM, Cheng Lian lian.cs@gmail.com wrote: Hey Yang, My comments are in-lined below. Cheng On 3/18/15 6:53 AM, Yang Lei wrote: Hello, I am migrating my Spark SQL external datasource integration from Spark 1.2.x to Spark 1.3. I noticed, there are a couple of new filters now, e.g. org.apache.spark.sql.sources.And. However, for a sql with condition A AND B, I noticed PrunedFilteredScan.buildScan still gets an Array[Filter] with 2 filters of A and B, while I have expected to get one And filter with left == A and right == B. So my first question is: where I can find out the rules for converting a SQL condition to the filters passed to the PrunedFilteredScan.buildScan. Top level AND predicates are always broken into smaller sub-predicates. The AND filter appeared in the external data sources API is for nested predicates, like A OR (NOT (B AND C)). I do like what I see on these And, Or, Not filters where we allow recursive nested definition to connect filters together. If this is the direction we are heading to, my second question is: if we just need one Filter object instead of Array[Filter] on the buildScan. For data sources with further filter push-down ability (e.g. Parquet), breaking down top level AND predicates for them can be convenient. The third question is: what our plan is to allow a relation provider to inform Spark which filters are handled already, so that there is no redundant filtering. Yeah, this is a good point, I guess we can add some method like filterAccepted to PrunedFilteredScan. Appreciate comments and links to any existing documentation or discussion. Yang
Question on Spark 1.3 SQL External Datasource
Hello, I am migrating my Spark SQL external datasource integration from Spark 1.2.x to Spark 1.3. I noticed, there are a couple of new filters now, e.g. org.apache.spark.sql.sources.And. However, for a sql with condition A AND B, I noticed PrunedFilteredScan.buildScan still gets an Array[Filter] with 2 filters of A and B, while I have expected to get one And filter with left == A and right == B. So my first question is: where I can find out the rules for converting a SQL condition to the filters passed to the PrunedFilteredScan.buildScan. I do like what I see on these And, Or, Not filters where we allow recursive nested definition to connect filters together. If this is the direction we are heading to, my second question is: if we just need one Filter object instead of Array[Filter] on the buildScan. The third question is: what our plan is to allow a relation provider to inform Spark which filters are handled already, so that there is no redundant filtering. Appreciate comments and links to any existing documentation or discussion. Yang