Re: StreamingContext.textFileStream issue

2015-04-25 Thread Yang Lei
I have no problem running the socket text stream sample in the same 
environment. 

Thanks

Yang

Sent from my iPhone

> On Apr 25, 2015, at 1:30 PM, Akhil Das  wrote:
> 
> Make sure you are having >=2 core for your streaming application.
> 
> Thanks
> Best Regards
> 
>> On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei  wrote:
>> I hit the same issue "as if the directory has no files at all" when running
>> the sample "examples/src/main/python/streaming/hdfs_wordcount.py" with a
>> local directory, and adding file into that directory . Appreciate comments
>> on how to resolve this.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 


Re: Spark on Mesos

2015-04-24 Thread Yang Lei
I run my Spark over Mesos by either running spark submit in a Docker
container using Marathon or from one of the node in mesos cluster.  I am on
mesos 0.21. I have tried both spark 1.3.1 and 1.2.1 with rebuild of hadoop
2.4 and above.

Some details on the configuration: I made sure that spark is using ip
addresses for all communication by
defining spark.driver.host, SPARK_PUBLIC_DNS, SPARK_LOCAL_IP, SPARK_LOCAL_HOST
in the right place.

Hope this help.

Yang.

On Fri, Apr 24, 2015 at 5:15 PM, Stephen Carman 
wrote:

> So I can’t for the life of me to get something even simple working for
> Spark on Mesos.
>
> I installed a 3 master, 3 slave mesos cluster, which is all configured,
> but I can’t for the life of me even get the spark shell to work properly.
>
> I get errors like this
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 5
> in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage
> 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor
> 20150424-104711-1375862026-5050-20113-S1 lost)
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> I tried both mesos 0.21 and 0.22 and they both produce the same error…
>
> My version of spark is 1.3.1 with hadoop 2.6, I just downloaded the
> pre-build from the site, or is that wrong and i have to build it myself?
>
> I have my mesos_native_java_library, spark executor URI and mesos master
> set in my spark-env.sh, they to the best of my abilities seem correct.
>
> Does anyone have any insight into this at all? I’m running this on red hat
> 7 with 8 CPU cores and 14gb of ram per slave, so 24 cores total and 42gb of
> ram total.
>
> Anyone have any idea at all what is going on here?
>
> Thanks,
> Steve
> This e-mail is intended solely for the above-mentioned recipient and it
> may contain confidential or privileged information. If you have received it
> in error, please notify us immediately and delete the e-mail. You must not
> copy, distribute, disclose or take any action in reliance on it. In
> addition, the contents of an attachment to this e-mail may contain software
> viruses which could damage your own computer system. While ColdLight
> Solutions, LLC has taken every reasonable precaution to minimize this risk,
> we cannot accept liability for any damage which you sustain as a result of
> software viruses. You should perform your own virus checks before opening
> the attachment.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: StreamingContext.textFileStream issue

2015-04-24 Thread Yang Lei
I hit the same issue "as if the directory has no files at all" when running
the sample "examples/src/main/python/streaming/hdfs_wordcount.py" with a
local directory, and adding file into that directory . Appreciate comments
on how to resolve this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Issue of running partitioned loading (RDD) in Spark External Datasource on Mesos

2015-04-20 Thread Yang Lei
I implemented two kinds of DataSource, one load data during buildScan, the
other returning my RDD class with partition information for future loading.

My RDD's compute gets actorSystem from  SparkEnv.get.actorSystem, then use
Spray to interact with a HTTP endpoint, which is the same flow as loading
data in buildScan.  All the Spray dependencies are included in a jar and
passes to spark-submit using --jar.

The Job is define in python.

Both scenarios work testing locally using --master local[4]. For mesos, the
not partitioned loading works too, but the partitioned loading hits the
following exception.

Traceback (most recent call last):

  File "/root/spark-1.3.1-bin-hadoop2.4/../CloudantApp.py", line 78, in


for code in airportData.collect():

  File "/root/spark-1.3.1-bin-hadoop2.4/python/pyspark/sql/dataframe.py",
line 293, in collect

port =
self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd())

  File
"/root/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__

  File
"/root/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling
o60.javaToPython.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 36.0 failed 4 times, most recent failure: Lost task 0.3 in stage
36.0 (TID 147, 198.11.207.72): com.typesafe.config.ConfigException$Missing:
No configuration setting found for key 'spray'

at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)

at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:147)

at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)

at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)

at com.typesafe.config.impl.SimpleConfig.getObject(SimpleConfig.java:218)

at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:224)

at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:33)

at spray.can.HttpExt.(Http.scala:143)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)

at scala.util.Try$.apply(Try.scala:161)

at
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)

at akka.actor.ExtensionKey.createExtension(Extension.scala:153)

at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:711)

at akka.actor.ExtensionId$class.apply(Extension.scala:79)

at akka.actor.ExtensionKey.apply(Extension.scala:149)

at akka.io.IO$.apply(IO.scala:30)

at spray.client.pipelining$.sendReceive(pipelining.scala:35)

at
com.cloudant.spark.common.JsonStoreDataAccess.getQueryResult(JsonStoreDataAccess.scala:118)

at
com.cloudant.spark.common.JsonStoreDataAccess.getIterator(JsonStoreDataAccess.scala:71)

at com.cloudant.spark.common.JsonStoreRDD.compute(JsonStoreRDD.scala:86)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Is this due to some kind of classpath setup issue on the executor for the
external jar for handing RDD?

Thanks in advance for any suggestions on how to resolve this.

Yang


Cloudant as Spark SQL External Datastore on Spark 1.3.0

2015-03-19 Thread Yang Lei
Check this out : https://github.com/cloudant/spark-cloudant. It supports
both the DataFrame and SQL approach for reading data from Cloudant and save
it .

Looking forward to your feedback on the project.

Yang


Re: Question on Spark 1.3 SQL External Datasource

2015-03-17 Thread Yang Lei

Thanks Cheng for the clarification. 

Looking forward to this new API mentioned below. 

Yang

Sent from my iPad

> On Mar 17, 2015, at 8:05 PM, Cheng Lian  wrote:
> 
> Hey Yang,
> 
> My comments are in-lined below.
> 
> Cheng
> 
>> On 3/18/15 6:53 AM, Yang Lei wrote:
>> Hello, 
>> 
>> I am migrating my Spark SQL external datasource integration from Spark 1.2.x 
>> to Spark 1.3. 
>> 
>> I noticed, there are a couple of new filters now,  e.g. 
>> org.apache.spark.sql.sources.And. However, for a sql with condition "A AND 
>> B", I noticed PrunedFilteredScan.buildScan still gets an Array[Filter] with 
>> 2 filters of A and B, while I have expected to get one "And" filter with 
>> left == A and right == B.
>> 
>> So my first question is: where I can find out the "rules" for converting a 
>> SQL condition to the filters passed to the PrunedFilteredScan.buildScan.
> Top level AND predicates are always broken into smaller sub-predicates. The 
> AND filter appeared in the external data sources API is for nested 
> predicates, like A OR (NOT (B AND C)).
>> 
>> I do like what I see on these And, Or, Not filters where we allow recursive 
>> nested definition to connect filters together. If this is the direction we 
>> are heading to, my second question is:  if we just need one Filter object 
>> instead of Array[Filter] on the buildScan.
> For data sources with further filter push-down ability (e.g. Parquet), 
> breaking down top level AND predicates for them can be convenient.
>> 
>> The third question is: what our plan is to allow a relation provider to 
>> inform Spark which filters are handled already, so that there is no 
>> redundant filtering.
> Yeah, this is a good point, I guess we can add some method like 
> "filterAccepted" to PrunedFilteredScan.
>> 
>> Appreciate comments and links to any existing documentation or discussion.
>> 
>> 
>> Yang
> 


Question on Spark 1.3 SQL External Datasource

2015-03-17 Thread Yang Lei
Hello,

I am migrating my Spark SQL external datasource integration from Spark
1.2.x to Spark 1.3.

I noticed, there are a couple of new filters now,  e.g.
org.apache.spark.sql.sources.And.
However, for a sql with condition "A AND B", I noticed
PrunedFilteredScan.buildScan
still gets an Array[Filter] with 2 filters of A and B, while I
have expected to get one "And" filter with left == A and right == B.

So my first question is: where I can find out the "rules" for converting a
SQL condition to the filters passed to the PrunedFilteredScan.buildScan.

I do like what I see on these And, Or, Not filters where we allow recursive
nested definition to connect filters together. If this is the direction we
are heading to, my second question is:  if we just need one Filter object
instead of Array[Filter] on the buildScan.

The third question is: what our plan is to allow a relation provider to
inform Spark which filters are handled already, so that there is
no redundant filtering.

Appreciate comments and links to any existing documentation or discussion.


Yang