Re: java.io.FileNotFoundException

2016-06-04 Thread kishore kumar
Hi, Could anyone help me about this error ? why this error comes ? Thanks, KishoreKuamr. On Fri, Jun 3, 2016 at 9:12 PM, kishore kumar wrote: > Hi Jeff Zhang, > > Thanks for response, could you explain me why this error occurs ? > > On Fri, Jun 3, 2016 at 6:15 PM, Jeff

Re: ImportError: No module named numpy

2016-06-04 Thread Daniel Rodriguez
Like people have said you need numpy in all the nodes of the cluster. The easiest way in my opinion is to use anaconda: https://www.continuum.io/downloads but that can get tricky to manage in multiple nodes if you don't have some configuration management skills. How are you deploying the spark

Re: ImportError: No module named numpy

2016-06-04 Thread Gourav Sengupta
Hi, I think that solution is too simple. Just download anaconda (if you pay for the licensed version you will eventually feel like being in heaven when you move to CI and CD and live in a world where you have a data product actually running in real life). Then start the pyspark program by

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-04 Thread Mich Talebzadeh
Hi, Spark works in local, standalone and yarn-client mode. Start as master = local. That is the simplest model.You DO not need to start $SPAK_HOME/sbin/start-master.sh and $SPAK_HOME/sbin/start-slaves.sh Also you do not need to specify all that in spark-submit. In the Scala code you can do val

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-04 Thread Alonso Isidoro Roman
Hi David, but removing setMaster line provokes this error: org.apache.spark.SparkException: A master URL must be set in your configuration at org.apache.spark.SparkContext.(SparkContext.scala:402) at example.spark.AmazonKafkaConnector$.main(AmazonKafkaConnectorWithMongo.scala:93) at

Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-04 Thread Cosmin Ciobanu
Microbatch is 20 seconds. We’re not using window operations. The graphs are for a test cluster, and the entire load is artificially generated by load tests (100k / 200k generated sessions). We’ve performed a few more performance tests. On the same 5 node cluster, with the same application: ·