Re: StreamingContext.textFileStream issue

2015-04-25 Thread Yang Lei
I have no problem running the socket text stream sample in the same 
environment. 

Thanks

Yang

Sent from my iPhone

 On Apr 25, 2015, at 1:30 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 Make sure you are having =2 core for your streaming application.
 
 Thanks
 Best Regards
 
 On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei genia...@gmail.com wrote:
 I hit the same issue as if the directory has no files at all when running
 the sample examples/src/main/python/streaming/hdfs_wordcount.py with a
 local directory, and adding file into that directory . Appreciate comments
 on how to resolve this.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 


Re: StreamingContext.textFileStream issue

2015-04-24 Thread Yang Lei
I hit the same issue as if the directory has no files at all when running
the sample examples/src/main/python/streaming/hdfs_wordcount.py with a
local directory, and adding file into that directory . Appreciate comments
on how to resolve this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on Mesos

2015-04-24 Thread Yang Lei
I run my Spark over Mesos by either running spark submit in a Docker
container using Marathon or from one of the node in mesos cluster.  I am on
mesos 0.21. I have tried both spark 1.3.1 and 1.2.1 with rebuild of hadoop
2.4 and above.

Some details on the configuration: I made sure that spark is using ip
addresses for all communication by
defining spark.driver.host, SPARK_PUBLIC_DNS, SPARK_LOCAL_IP, SPARK_LOCAL_HOST
in the right place.

Hope this help.

Yang.

On Fri, Apr 24, 2015 at 5:15 PM, Stephen Carman scar...@coldlight.com
wrote:

 So I can’t for the life of me to get something even simple working for
 Spark on Mesos.

 I installed a 3 master, 3 slave mesos cluster, which is all configured,
 but I can’t for the life of me even get the spark shell to work properly.

 I get errors like this
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 5
 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage
 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor
 20150424-104711-1375862026-5050-20113-S1 lost)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

 I tried both mesos 0.21 and 0.22 and they both produce the same error…

 My version of spark is 1.3.1 with hadoop 2.6, I just downloaded the
 pre-build from the site, or is that wrong and i have to build it myself?

 I have my mesos_native_java_library, spark executor URI and mesos master
 set in my spark-env.sh, they to the best of my abilities seem correct.

 Does anyone have any insight into this at all? I’m running this on red hat
 7 with 8 CPU cores and 14gb of ram per slave, so 24 cores total and 42gb of
 ram total.

 Anyone have any idea at all what is going on here?

 Thanks,
 Steve
 This e-mail is intended solely for the above-mentioned recipient and it
 may contain confidential or privileged information. If you have received it
 in error, please notify us immediately and delete the e-mail. You must not
 copy, distribute, disclose or take any action in reliance on it. In
 addition, the contents of an attachment to this e-mail may contain software
 viruses which could damage your own computer system. While ColdLight
 Solutions, LLC has taken every reasonable precaution to minimize this risk,
 we cannot accept liability for any damage which you sustain as a result of
 software viruses. You should perform your own virus checks before opening
 the attachment.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Issue of running partitioned loading (RDD) in Spark External Datasource on Mesos

2015-04-20 Thread Yang Lei
I implemented two kinds of DataSource, one load data during buildScan, the
other returning my RDD class with partition information for future loading.

My RDD's compute gets actorSystem from  SparkEnv.get.actorSystem, then use
Spray to interact with a HTTP endpoint, which is the same flow as loading
data in buildScan.  All the Spray dependencies are included in a jar and
passes to spark-submit using --jar.

The Job is define in python.

Both scenarios work testing locally using --master local[4]. For mesos, the
not partitioned loading works too, but the partitioned loading hits the
following exception.

Traceback (most recent call last):

  File /root/spark-1.3.1-bin-hadoop2.4/../CloudantApp.py, line 78, in
module

for code in airportData.collect():

  File /root/spark-1.3.1-bin-hadoop2.4/python/pyspark/sql/dataframe.py,
line 293, in collect

port =
self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd())

  File
/root/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__

  File
/root/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
line 300, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling
o60.javaToPython.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 36.0 failed 4 times, most recent failure: Lost task 0.3 in stage
36.0 (TID 147, 198.11.207.72): com.typesafe.config.ConfigException$Missing:
No configuration setting found for key 'spray'

at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)

at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:147)

at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)

at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)

at com.typesafe.config.impl.SimpleConfig.getObject(SimpleConfig.java:218)

at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:224)

at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:33)

at spray.can.HttpExt.init(Http.scala:143)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)

at scala.util.Try$.apply(Try.scala:161)

at
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)

at akka.actor.ExtensionKey.createExtension(Extension.scala:153)

at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:711)

at akka.actor.ExtensionId$class.apply(Extension.scala:79)

at akka.actor.ExtensionKey.apply(Extension.scala:149)

at akka.io.IO$.apply(IO.scala:30)

at spray.client.pipelining$.sendReceive(pipelining.scala:35)

at
com.cloudant.spark.common.JsonStoreDataAccess.getQueryResult(JsonStoreDataAccess.scala:118)

at
com.cloudant.spark.common.JsonStoreDataAccess.getIterator(JsonStoreDataAccess.scala:71)

at com.cloudant.spark.common.JsonStoreRDD.compute(JsonStoreRDD.scala:86)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Is this due to some kind of classpath setup issue on the executor for the
external jar for handing RDD?

Thanks in advance for any suggestions on how to resolve this.

Yang


Cloudant as Spark SQL External Datastore on Spark 1.3.0

2015-03-19 Thread Yang Lei
Check this out : https://github.com/cloudant/spark-cloudant. It supports
both the DataFrame and SQL approach for reading data from Cloudant and save
it .

Looking forward to your feedback on the project.

Yang


Re: Question on Spark 1.3 SQL External Datasource

2015-03-17 Thread Yang Lei

Thanks Cheng for the clarification. 

Looking forward to this new API mentioned below. 

Yang

Sent from my iPad

 On Mar 17, 2015, at 8:05 PM, Cheng Lian lian.cs@gmail.com wrote:
 
 Hey Yang,
 
 My comments are in-lined below.
 
 Cheng
 
 On 3/18/15 6:53 AM, Yang Lei wrote:
 Hello, 
 
 I am migrating my Spark SQL external datasource integration from Spark 1.2.x 
 to Spark 1.3. 
 
 I noticed, there are a couple of new filters now,  e.g. 
 org.apache.spark.sql.sources.And. However, for a sql with condition A AND 
 B, I noticed PrunedFilteredScan.buildScan still gets an Array[Filter] with 
 2 filters of A and B, while I have expected to get one And filter with 
 left == A and right == B.
 
 So my first question is: where I can find out the rules for converting a 
 SQL condition to the filters passed to the PrunedFilteredScan.buildScan.
 Top level AND predicates are always broken into smaller sub-predicates. The 
 AND filter appeared in the external data sources API is for nested 
 predicates, like A OR (NOT (B AND C)).
 
 I do like what I see on these And, Or, Not filters where we allow recursive 
 nested definition to connect filters together. If this is the direction we 
 are heading to, my second question is:  if we just need one Filter object 
 instead of Array[Filter] on the buildScan.
 For data sources with further filter push-down ability (e.g. Parquet), 
 breaking down top level AND predicates for them can be convenient.
 
 The third question is: what our plan is to allow a relation provider to 
 inform Spark which filters are handled already, so that there is no 
 redundant filtering.
 Yeah, this is a good point, I guess we can add some method like 
 filterAccepted to PrunedFilteredScan.
 
 Appreciate comments and links to any existing documentation or discussion.
 
 
 Yang
 


Question on Spark 1.3 SQL External Datasource

2015-03-17 Thread Yang Lei
Hello,

I am migrating my Spark SQL external datasource integration from Spark
1.2.x to Spark 1.3.

I noticed, there are a couple of new filters now,  e.g.
org.apache.spark.sql.sources.And.
However, for a sql with condition A AND B, I noticed
PrunedFilteredScan.buildScan
still gets an Array[Filter] with 2 filters of A and B, while I
have expected to get one And filter with left == A and right == B.

So my first question is: where I can find out the rules for converting a
SQL condition to the filters passed to the PrunedFilteredScan.buildScan.

I do like what I see on these And, Or, Not filters where we allow recursive
nested definition to connect filters together. If this is the direction we
are heading to, my second question is:  if we just need one Filter object
instead of Array[Filter] on the buildScan.

The third question is: what our plan is to allow a relation provider to
inform Spark which filters are handled already, so that there is
no redundant filtering.

Appreciate comments and links to any existing documentation or discussion.


Yang