from:"tian zhang"


Hi, Dear Spark Streaming Developers and Users,
We are prototyping using spark streaming and hit the following 2 issues thatI 
would like to seek your expertise.
1) We have a spark streaming application in scala, that reads  data from Kafka 
intoa DStream, does some processing and output a transformed DStream. If for 
some reasonthe Kafka connection is not available or timed out, the spark 
streaming job will startto send empty RDD afterwards. The log is clean w/o any 
ERROR indicator. I googled  around and this seems to be a known issue.We 
believe that spark streaming infrastructure should either retry or return 
error/exception.Can you share how you handle this case?
2) We would like implement a spark streaming job that join an 1 minute  
duration DStream of real time eventswith a metadata RDD that was read from a 
database. The metadata only changes slightly each day in the database.So what 
is the best practice of refresh the RDD daily keep the streaming join job 
running? Is this do-able as of spark 1.1.0?
Thanks.
Tian

spark 1.1.0 (w/ hadoop 2.4) vs aws java sdk 1.7.2

2014-09-19 Thread tian zhang



Hi, Spark experts,

I have the following issue when using aws java sdk in my spark application. 
Here I narrowed down the following steps to reproduce the problem

1) I have Spark 1.1.0 with hadoop 2.4 installed on 3 nodes cluster
2) from the master node, I did the following steps.
spark-shell --jars  ws-java-sdk-1.7.2.jar 
import com.amazonaws.{Protocol, ClientConfiguration}
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.services.s3.AmazonS3Client
val clientConfiguration = new ClientConfiguration()
val s3accessKey="X"
val s3secretKey="Y"
val credentials = new BasicAWSCredentials(s3accessKey,s3secretKey)
println("CLASSPATH="+System.getenv("CLASSPATH"))
CLASSPATH=::/home/hadoop/spark/conf:/home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar:/home/hadoop/conf:/home/hadoop/conf
println("java.class.path="+System.getProperty("java.class.path"))
java.class.path=::/home/hadoop/spark/conf:/home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar:/home/hadoop/conf:/home/hadoop/conf

So far all look good and normal. But then the following step will fail and it 
looks like the class loader can't resolve to the right class. Any suggestion
for Spark application that requires aws sdk?

scala> val s3Client = new AmazonS3Client(credentials, clientConfiguration)
java.lang.NoClassDefFoundError: 
org/apache/http/impl/conn/PoolingClientConnectionManager
at 
com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26)
at 
com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:96)
at com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:155)
at com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:119)
at com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:103)
at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:334)
at $iwC$$iwC$$iwC$$iwC.(:21)
at $iwC$$iwC$$iwC.(:26)
at $iwC$$iwC.(:28)
at $iwC.(:30)
at (:32)
at .(:36)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: 
org.apache.http.impl.conn.PoolingClientConnectionManager
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 46 more

Thanks.

Tian

unsubscribe

unsubscribe

2 spark streaming questions

spark 1.1.0 (w/ hadoop 2.4) vs aws java sdk 1.7.2

4 matches

Site Navigation

Mail list logo

Footer information