Re: StreamingContext.textFileStream issue
I have no problem running the socket text stream sample in the same environment. Thanks Yang Sent from my iPhone > On Apr 25, 2015, at 1:30 PM, Akhil Das wrote: > > Make sure you are having >=2 core for your streaming application. > > Thanks > Best Regards > >> On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei wrote: >> I hit the same issue "as if the directory has no files at all" when running >> the sample "examples/src/main/python/streaming/hdfs_wordcount.py" with a >> local directory, and adding file into that directory . Appreciate comments >> on how to resolve this. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >
Re: Spark on Mesos
I run my Spark over Mesos by either running spark submit in a Docker container using Marathon or from one of the node in mesos cluster. I am on mesos 0.21. I have tried both spark 1.3.1 and 1.2.1 with rebuild of hadoop 2.4 and above. Some details on the configuration: I made sure that spark is using ip addresses for all communication by defining spark.driver.host, SPARK_PUBLIC_DNS, SPARK_LOCAL_IP, SPARK_LOCAL_HOST in the right place. Hope this help. Yang. On Fri, Apr 24, 2015 at 5:15 PM, Stephen Carman wrote: > So I can’t for the life of me to get something even simple working for > Spark on Mesos. > > I installed a 3 master, 3 slave mesos cluster, which is all configured, > but I can’t for the life of me even get the spark shell to work properly. > > I get errors like this > org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 > in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage > 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor > 20150424-104711-1375862026-5050-20113-S1 lost) > Driver stacktrace: > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > > I tried both mesos 0.21 and 0.22 and they both produce the same error… > > My version of spark is 1.3.1 with hadoop 2.6, I just downloaded the > pre-build from the site, or is that wrong and i have to build it myself? > > I have my mesos_native_java_library, spark executor URI and mesos master > set in my spark-env.sh, they to the best of my abilities seem correct. > > Does anyone have any insight into this at all? I’m running this on red hat > 7 with 8 CPU cores and 14gb of ram per slave, so 24 cores total and 42gb of > ram total. > > Anyone have any idea at all what is going on here? > > Thanks, > Steve > This e-mail is intended solely for the above-mentioned recipient and it > may contain confidential or privileged information. If you have received it > in error, please notify us immediately and delete the e-mail. You must not > copy, distribute, disclose or take any action in reliance on it. In > addition, the contents of an attachment to this e-mail may contain software > viruses which could damage your own computer system. While ColdLight > Solutions, LLC has taken every reasonable precaution to minimize this risk, > we cannot accept liability for any damage which you sustain as a result of > software viruses. You should perform your own virus checks before opening > the attachment. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: StreamingContext.textFileStream issue
I hit the same issue "as if the directory has no files at all" when running the sample "examples/src/main/python/streaming/hdfs_wordcount.py" with a local directory, and adding file into that directory . Appreciate comments on how to resolve this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Issue of running partitioned loading (RDD) in Spark External Datasource on Mesos
I implemented two kinds of DataSource, one load data during buildScan, the other returning my RDD class with partition information for future loading. My RDD's compute gets actorSystem from SparkEnv.get.actorSystem, then use Spray to interact with a HTTP endpoint, which is the same flow as loading data in buildScan. All the Spray dependencies are included in a jar and passes to spark-submit using --jar. The Job is define in python. Both scenarios work testing locally using --master local[4]. For mesos, the not partitioned loading works too, but the partitioned loading hits the following exception. Traceback (most recent call last): File "/root/spark-1.3.1-bin-hadoop2.4/../CloudantApp.py", line 78, in for code in airportData.collect(): File "/root/spark-1.3.1-bin-hadoop2.4/python/pyspark/sql/dataframe.py", line 293, in collect port = self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd()) File "/root/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/root/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o60.javaToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 36.0 failed 4 times, most recent failure: Lost task 0.3 in stage 36.0 (TID 147, 198.11.207.72): com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'spray' at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124) at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:147) at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159) at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164) at com.typesafe.config.impl.SimpleConfig.getObject(SimpleConfig.java:218) at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:224) at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:33) at spray.can.HttpExt.(Http.scala:143) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78) at scala.util.Try$.apply(Try.scala:161) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73) at akka.actor.ExtensionKey.createExtension(Extension.scala:153) at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:711) at akka.actor.ExtensionId$class.apply(Extension.scala:79) at akka.actor.ExtensionKey.apply(Extension.scala:149) at akka.io.IO$.apply(IO.scala:30) at spray.client.pipelining$.sendReceive(pipelining.scala:35) at com.cloudant.spark.common.JsonStoreDataAccess.getQueryResult(JsonStoreDataAccess.scala:118) at com.cloudant.spark.common.JsonStoreDataAccess.getIterator(JsonStoreDataAccess.scala:71) at com.cloudant.spark.common.JsonStoreRDD.compute(JsonStoreRDD.scala:86) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Is this due to some kind of classpath setup issue on the executor for the external jar for handing RDD? Thanks in advance for any suggestions on how to resolve this. Yang
Cloudant as Spark SQL External Datastore on Spark 1.3.0
Check this out : https://github.com/cloudant/spark-cloudant. It supports both the DataFrame and SQL approach for reading data from Cloudant and save it . Looking forward to your feedback on the project. Yang
Re: Question on Spark 1.3 SQL External Datasource
Thanks Cheng for the clarification. Looking forward to this new API mentioned below. Yang Sent from my iPad > On Mar 17, 2015, at 8:05 PM, Cheng Lian wrote: > > Hey Yang, > > My comments are in-lined below. > > Cheng > >> On 3/18/15 6:53 AM, Yang Lei wrote: >> Hello, >> >> I am migrating my Spark SQL external datasource integration from Spark 1.2.x >> to Spark 1.3. >> >> I noticed, there are a couple of new filters now, e.g. >> org.apache.spark.sql.sources.And. However, for a sql with condition "A AND >> B", I noticed PrunedFilteredScan.buildScan still gets an Array[Filter] with >> 2 filters of A and B, while I have expected to get one "And" filter with >> left == A and right == B. >> >> So my first question is: where I can find out the "rules" for converting a >> SQL condition to the filters passed to the PrunedFilteredScan.buildScan. > Top level AND predicates are always broken into smaller sub-predicates. The > AND filter appeared in the external data sources API is for nested > predicates, like A OR (NOT (B AND C)). >> >> I do like what I see on these And, Or, Not filters where we allow recursive >> nested definition to connect filters together. If this is the direction we >> are heading to, my second question is: if we just need one Filter object >> instead of Array[Filter] on the buildScan. > For data sources with further filter push-down ability (e.g. Parquet), > breaking down top level AND predicates for them can be convenient. >> >> The third question is: what our plan is to allow a relation provider to >> inform Spark which filters are handled already, so that there is no >> redundant filtering. > Yeah, this is a good point, I guess we can add some method like > "filterAccepted" to PrunedFilteredScan. >> >> Appreciate comments and links to any existing documentation or discussion. >> >> >> Yang >
Question on Spark 1.3 SQL External Datasource
Hello, I am migrating my Spark SQL external datasource integration from Spark 1.2.x to Spark 1.3. I noticed, there are a couple of new filters now, e.g. org.apache.spark.sql.sources.And. However, for a sql with condition "A AND B", I noticed PrunedFilteredScan.buildScan still gets an Array[Filter] with 2 filters of A and B, while I have expected to get one "And" filter with left == A and right == B. So my first question is: where I can find out the "rules" for converting a SQL condition to the filters passed to the PrunedFilteredScan.buildScan. I do like what I see on these And, Or, Not filters where we allow recursive nested definition to connect filters together. If this is the direction we are heading to, my second question is: if we just need one Filter object instead of Array[Filter] on the buildScan. The third question is: what our plan is to allow a relation provider to inform Spark which filters are handled already, so that there is no redundant filtering. Appreciate comments and links to any existing documentation or discussion. Yang