identifying newly arrived files in s3 in spark streaming

2016-06-06 Thread pandees waran
I am fairly new to spark streaming and i have a basic question on how spark streaming works on s3 bucket which is periodically getting new files once in 10 mins. When i use spark streaming to process these files in this s3 path, will it process all the files in this path (old+new files) every

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread Marcelo Vanzin
There should be a spark-defaults.conf file somewhere in your machine; that's where the config is. You can try to change it, but if you're using some tool to manage configuration for you, your changes might end up being overwritten, so be careful with that. You can also try "--properties-file

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread verylucky...@gmail.com
Thank you Marcelo. I don't know how to remove it. Could you please tell me how I can remove that configuration? On Mon, Jun 6, 2016 at 5:04 PM, Marcelo Vanzin wrote: > This sounds like your default Spark configuration has an > "enabledAlgorithms" config in the SSL

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread Marcelo Vanzin
This sounds like your default Spark configuration has an "enabledAlgorithms" config in the SSL settings, and that is listing an algorithm name that is not available in jdk8. Either remove that configuration (to use the JDK's default algorithm list), or change it so that it lists algorithms

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread verylucky...@gmail.com
Thank you Ted for the reference. I am going through it in detail. Thank you Marco for your suggestion. I created a properties file with these two lines spark.driver.extraJavaOptions -Djsse.enableSNIExtension=false spark.executor.extraJavaOptions -Djsse.enableSNIExtension=false and gave this

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread Marco Mistroni
HI have you tried to add this flag? -Djsse.enableSNIExtension=false i had similar issues in another standalone application when i switched to java8 from java7 hth marco On Mon, Jun 6, 2016 at 9:58 PM, Koert Kuipers wrote: > mhh i would not be very happy if the implication

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread Koert Kuipers
mhh i would not be very happy if the implication is that i have to start maintaining separate spark builds for client clusters that use java 8... On Mon, Jun 6, 2016 at 4:34 PM, Ted Yu wrote: > Please see: > https://spark.apache.org/docs/latest/security.html > > w.r.t. Java

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread Ted Yu
Please see: https://spark.apache.org/docs/latest/security.html w.r.t. Java 8, probably you need to rebuild 1.5.2 using Java 8. Cheers On Mon, Jun 6, 2016 at 1:19 PM, verylucky...@gmail.com < verylucky...@gmail.com> wrote: > Thank you for your response. > > I have seen this and couple of other

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread verylucky...@gmail.com
Thank you for your response. I have seen this and couple of other similar ones about java ssl in general. However, I am not sure how it applies to Spark and specifically to my case. This error I mention above occurs when I switch from java 7 to java 8 by changing the env variable JAVA_HOME. The

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread Ted Yu
Have you seen this ? http://stackoverflow.com/questions/22423063/java-exception-on-sslsocket-creation On Mon, Jun 6, 2016 at 12:31 PM, verylucky Man wrote: > Hi, > > I have a cluster (Hortonworks supported system) running Apache spark on > 1.5.2 on Java 7, installed by

Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread verylucky Man
Hi, I have a cluster (Hortonworks supported system) running Apache spark on 1.5.2 on Java 7, installed by admin. Java 8 is also installed. I don't have admin access to this cluster and would like to run spark (1.5.2 and later versions) on java 8. I come from HPC/MPI background. So I naively

Re: Specify node where driver should run

2016-06-06 Thread Bryan Cutler
I'm not an expert on YARN so anyone please correct me if I'm wrong, but I believe the Resource Manager will schedule the application to be run on the AM of any node that has a Node Manager, depending on available resources. So you would normally query the RM via the REST API to determine that.

Re: groupByKey returns an emptyRDD

2016-06-06 Thread Ted Yu
Can you give us a bit more information ? how you packaged the code into jar command you used for execution version of Spark related log snippet Thanks On Mon, Jun 6, 2016 at 10:43 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm wrapped the following code into a jar: > >

groupByKey returns an emptyRDD

2016-06-06 Thread Daniel Haviv
Hi, I'm wrapped the following code into a jar: val test = sc.parallelize(Seq(("daniel", "a"), ("daniel", "b"), ("test", "1)"))) val agg = test.groupByKey() agg.collect.foreach(r=>{println(r._1)}) The result of groupByKey is an empty RDD, when I'm trying the same code using the spark-shell it's

Re: Dataset Outer Join vs RDD Outer Join

2016-06-06 Thread Michael Armbrust
That kind of stuff is likely fixed in 2.0. If you can get a reproduction working there it would be very helpful if you could open a JIRA. On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher wrote: > A quick unit test attempt didn't get far replacing map with as[], I'm

Re: Unable to set ContextClassLoader in spark shell

2016-06-06 Thread Marcelo Vanzin
On Mon, Jun 6, 2016 at 4:22 AM, shengzhixia wrote: > In my previous Java project I can change class loader without problem. Could > I know why the above method couldn't change class loader in spark shell? > Any way I can achieve it? The spark-shell for Scala 2.10 will

Re: Specify node where driver should run

2016-06-06 Thread Saiph Kappa
How can I specify the node where application master should run in the yarn conf? I haven't found any useful information regarding that. Thanks. On Mon, Jun 6, 2016 at 4:52 PM, Bryan Cutler wrote: > In that mode, it will run on the application master, whichever node that > is

Re: Specify node where driver should run

2016-06-06 Thread Bryan Cutler
In that mode, it will run on the application master, whichever node that is as specified in your yarn conf. On Jun 5, 2016 4:54 PM, "Saiph Kappa" wrote: > Hi, > > In yarn-cluster mode, is there any way to specify on which node I want the > driver to run? > > Thanks. >

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-06 Thread Alonso Isidoro Roman
Hi, just to update the thread, i have just submited a simple wordcount job using yarn using this command: [cloudera@quickstart simple-word-count]$ spark-submit --class com.example.Hello --master yarn --deploy-mode cluster --driver-memory 1024Mb --executor-memory 1G --executor-cores 1

subscribe

2016-06-06 Thread Kishorkumar Patil

Re: Dataset Outer Join vs RDD Outer Join

2016-06-06 Thread Richard Marscher
A quick unit test attempt didn't get far replacing map with as[], I'm only working against 1.6.1 at the moment though, I was going to try 2.0 but I'm having a hard time building a working spark-sql jar from source, the only ones I've managed to make are intended for the full assembly fat jar.

Logging from transformations in PySpark

2016-06-06 Thread Michael Ravits
Hi, I'd like to send some performance metrics from some of the transformations to StatsD. I understood that I should create a new connection to StatsD from each transformation which I'm afraid would harm performance. I've also read that there is a workaround for this in Scala by defining an

Re: Spark SQL - Encoders - case class

2016-06-06 Thread Dave Maughan
Hi, Thanks for the quick replies. I've tried those suggestions but Eclipse is showing: *Unable** to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other

Re: Spark SQL - Encoders - case class

2016-06-06 Thread Han JU
Hi, I think encoders for case classes are already provided in spark. You'll just need to import them. val sql = new SQLContext(sc) import sql.implicits._ And then do the cast to Dataset. 2016-06-06 14:13 GMT+02:00 Dave Maughan : > Hi, > > I've figured out how

Spark SQL - Encoders - case class

2016-06-06 Thread Dave Maughan
Hi, I've figured out how to select data from a remote Hive instance and encode the DataFrame -> Dataset using a Java POJO class: TestHive.sql("select foo_bar as `fooBar` from table1" ).as(Encoders.bean(classOf[Table1])).show() However, I'm struggling to find out to do the equivalent in

Re: How to modify collection inside a spark rdd foreach

2016-06-06 Thread Robineast
It's not that clear what you are trying to achieve - what type is myRDD and where do k and v come from? Anyway it seems you want to end up with a map or a dictionary which is what PairRDD is for e.g. val rdd = sc.makeRDD(Array("1","2","3")) val pairRDD = rdd.map(el => (el.toInt, el)) -

Unable to set ContextClassLoader in spark shell

2016-06-06 Thread shengzhixia
Hello guys! I am using spark shell which uses TranslatingClassLoader. scala> Thread.currentThread().getContextClassLoader res13: ClassLoader = org.apache.spark.repl.SparkIMain$TranslatingClassLoader@23c767e6 For some reason I want to use another class loader, but when I do val myclassloader =

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-06 Thread Mich Talebzadeh
have you tried master local that should work. This works as a test ${SPARK_HOME}/bin/spark-submit \ --driver-memory 2G \ --num-executors 1 \ --executor-memory 2G \ --master local[2] \ --executor-cores 2 \

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-06 Thread Alonso Isidoro Roman
Hi guys, i finally understand that i cannot use sbt-pack to use programmatically the spark-streaming job as unix commands, i have to use yarn or mesos in order to run the jobs. I have some doubts, if i run the spark streaming jogs as yarn client mode, i am receiving this exception:

Re: Switching broadcast mechanism from torrrent

2016-06-06 Thread Daniel Haviv
Hi, I've set spark.broadcast.factory to org.apache.spark.broadcast.HttpBroadcastFactory and it indeed resolve my issue. I'm creating a dataframe which creates a broadcast variable internally and then fails due to the torrent broadcast with the following stacktrace: Caused by:

Re: Fw: Basic question on using one's own classes in the Scala app

2016-06-06 Thread Marco Mistroni
HI Ashok this is not really a spark-related question so i would not use this mailing list. Anyway, my 2 cents here as outlined by earlier replies, if the class you are referencing is in a different jar, at compile time you will need to add that dependency to your build.sbt, I'd personally

Re: Performance of Spark/MapReduce

2016-06-06 Thread Sean Owen
I don't think that quote is true in general. Given a map-only task, or a map+shuffle+reduce, I'd expect MapReduce to be the same speed or a little faster. It is the simpler, lower-level, narrower mechanism. It's all limited by how fast you can read/write data and execute the user code. There's a

Fw: Basic question on using one's own classes in the Scala app

2016-06-06 Thread Ashok Kumar
Anyone can help me with this please On Sunday, 5 June 2016, 11:06, Ashok Kumar wrote: Hi all, Appreciate any advice on this. It is about scala I have created a very basic Utilities.scala that contains a test class and method. I intend to add my own classes and