RE: Spark and Kafka integration

2017-01-12 Thread Phadnis, Varun
Cool! Thanks for your inputs Jacek and Mark! From: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: 13 January 2017 12:59 To: Phadnis, Varun Cc: user@spark.apache.org Subject: Re: Spark and Kafka integration See "API compatibility" in

Re: Re: Re: how to change datatype by useing StructType

2017-01-12 Thread lk_spark
a better way to answer my question: use GenericRow instead of Row val rows: RDD[Row] = spark.sparkContext.textFile("/sourcedata/test/test1").map { line => { val attributes: Array[String] = line.split(",") val ab = ArrayBuffer[Any]() for (i

Re: Sporadic ClassNotFoundException with Kryo

2017-01-12 Thread Nirmal Fernando
I faced a similar issue and had to do two things; 1. Submit Kryo jar with the spark-submit 2. Set spark.executor.userClassPathFirst true in Spark conf On Fri, Nov 18, 2016 at 7:39 PM, chrism wrote: > Regardless of the different ways we have tried deploying a jar

Re: Data frame writing

2017-01-12 Thread Rajendra Bhat
initially my there is no dir, directory which created by spark job. it should empty while job execution. df write itself create first file and trying to overwrite it seems. On Fri, Jan 13, 2017 at 11:42 AM, Amrit Jangid wrote: > Hi Rajendra, > > It says your

Re: Data frame writing

2017-01-12 Thread Amrit Jangid
Hi Rajendra, It says your directory is not empty *s3n://**buccketName/cip/daily_date.* Try to use save *mode. eg * df.write.mode(SaveMode.Overwrite).partitionBy("date").f ormat("com.databricks.spark.csv").option("delimiter", "#").option("codec", "

Data frame writing

2017-01-12 Thread Rajendra Bhat
Hi team, I am reading N number of csv and writing file based date partition. date is one column, it has integer value(ex 20170101) val df = spark.read .format("com.databricks.spark.csv") .schema(schema) .option("delimiter","#") .option("nullValue","")

Re: Can't load a RandomForestClassificationModel in Spark job

2017-01-12 Thread Sumona Routh
Yes, I save it to S3 in a different process. It is actually the RandomForestClassificationModel.load method (passed an s3 path) where I run into problems. When you say you load it during map stages, do you mean that you are able to directly load a model from inside of a transformation? When I try

Re: Re: Re: how to change datatype by useing StructType

2017-01-12 Thread lk_spark
Thank you Nicholas , if the sourcedata was csv format ,CSV reader works well. 2017-01-13 lk_spark 发件人:Nicholas Hakobian 发送时间:2017-01-13 08:35 主题:Re: Re: Re: how to change datatype by useing StructType 收件人:"lk_spark" 抄送:"ayan

Re: Can't load a RandomForestClassificationModel in Spark job

2017-01-12 Thread ayan guha
Hi Given training and predictions are two different applications, I typically save model objects to hdfs and load it back during prediction map stages. Best Ayan On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh wrote: > Hi all, > I've been working with Spark mllib 2.0.2

Any suggestions for dbScan

2017-01-12 Thread shobhit gupta
Hi Everyone, Is there any suggestion for dbScan scala implementation? My application code is running on Spark 2.0 but any suggestion is fine. -- Regards , Shobhit G

problem detecting FAILED state with SparkLauncher and SparkAppHandle

2017-01-12 Thread adam kramer
Hi All - I'm having an issue with detecting a failed Spark application state when using the startApplication method and SparkAppHandle with the SparkLauncher in Spark 2.0.1. Previous I had used a Java Process to waitFor it to return an non-zero exit code to detect failure which worked. But when

Re: Re: Re: how to change datatype by useing StructType

2017-01-12 Thread Nicholas Hakobian
Have you tried the native CSV reader (in spark 2) or the Databricks CSV reader (in 1.6). If your format is in a CSV like format it'll load it directly into a DataFrame. Its possible you have some rows where types are inconsistent. Nicholas Szandor Hakobian, Ph.D. Senior Data Scientist Rally

Re: spark locality

2017-01-12 Thread Michael Gummelt
If the executor reports a different hostname inside the CNI container, then no, I don't think so. On Thu, Jan 12, 2017 at 2:28 PM, vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > So even if I make the Spark executors run on the same node as Casssandra > nodes, I am not sure each

Re: spark locality

2017-01-12 Thread vincent gromakowski
So even if I make the Spark executors run on the same node as Casssandra nodes, I am not sure each worker will connect to c* nodes on the same mesos agent ? 2017-01-12 21:13 GMT+01:00 Michael Gummelt : > The code in there w/ docs that reference CNI doesn't actually run

Is RAND() in SparkSQL deterministic when used on MySql data sources?

2017-01-12 Thread Gabriele Del Prete
Hi all, We need to use the rand() function in Scala Spark SQL in our application, but we discovered that it behavior was not deterministic, that is, different invocations with the same would result in different values. This is documented in some bugs, for example:

Re: spark locality

2017-01-12 Thread Michael Gummelt
The code in there w/ docs that reference CNI doesn't actually run when CNI is in effect, and doesn't have anything to do with locality. It's just making Spark work in a no-DNS environment On Thu, Jan 12, 2017 at 12:04 PM, vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > I have

Re: spark locality

2017-01-12 Thread vincent gromakowski
I have found this but I am not sure how it can help... https://github.com/mesosphere/spark-build/blob/a9efef8850976f787956660262f3b77cd636f3f5/conf/spark-env.sh 2017-01-12 20:16 GMT+01:00 Michael Gummelt : > That's a good point. I hadn't considered the locality

Re: Spark and Kafka integration

2017-01-12 Thread Mark Hamstra
See "API compatibility" in http://spark.apache.org/versioning-policy.html While code that is annotated as Experimental is still a good faith effort to provide a stable and useful API, the fact is that we're not yet confident enough that we've got the public API in exactly the form that we want to

Re: spark locality

2017-01-12 Thread Michael Gummelt
That's a good point. I hadn't considered the locality implications of CNI yet. I think tasks are placed based on the hostname reported by the executor, which in a CNI container will be different than the HDFS/Cassandra hostname. I'm not aware of anyone running Spark+CNI in prod yet, either.

Can't load a RandomForestClassificationModel in Spark job

2017-01-12 Thread Sumona Routh
Hi all, I've been working with Spark mllib 2.0.2 RandomForestClassificationModel. I encountered two frustrating issues and would really appreciate some advice: 1) RandomForestClassificationModel is effectively not serializable (I assume it's referencing something that can't be serialized, since

Re: H2O DataFrame to Spark RDD/DataFrame

2017-01-12 Thread Nicholas Sharkey
Page 33 of the Sparkling Water Booklet: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/SparklingWaterBooklet.pdf df = sqlContext.read.format("h2o").option("key",frame.frame_id).load() df = sqlContext.read.format("h2o").load(frame.frame_id) On Thu, Jan 12, 2017 at 1:17 PM, Md. Rezaul

H2O DataFrame to Spark RDD/DataFrame

2017-01-12 Thread Md. Rezaul Karim
Hi there, Is there any way to convert an H2O DataFrame to equivalent Spark RDD or DataFrame? I found a good documentation on "*Machine Learning with Sparkling Water: H2O + Spark*" here at.

Re: mysql and Spark jdbc

2017-01-12 Thread Jorge Machado
Nice it worked !! thx Jorge Machado www.jmachado.me > On 12 Jan 2017, at 17:46, Asher Krim wrote: > > Have you tried using an alias? You should be able to replace > ("dbtable”,"sometable") with ("dbtable”,"SELECT utc_timestamp AS my_timestamp > FROM sometable") > >

Re: [Spark Streaming] NoClassDefFoundError : StateSpec

2017-01-12 Thread Shixiong(Ryan) Zhu
You can find the Spark version of spark-submit in the log. Could you check if it's not consistent? On Thu, Jan 12, 2017 at 7:35 AM Ramkumar Venkataraman < ram.the.m...@gmail.com> wrote: > Spark: 1.6.1 > > I am trying to use the new mapWithState API and I am getting the following > error: > >

Re: Running a spark code using submit job in google cloud platform

2017-01-12 Thread A Shaikh
You may have tested this code on Spark version on your local machine version of which may be different to whats in Google Cloud Storage. You need to select appropraite Spark version when you submit your job. On 12 January 2017 at 15:51, Anahita Talebi wrote: > Dear

spark rdd map error: too many arguments for unapply pattern, maximum = 22

2017-01-12 Thread Anton Kravchenko
Hi there, When I do rdd map with more than 22 columns - I get "error: too many arguments for unapply pattern, maximum = 22". scala> val rddRes=rows.map{case Row(col1,..col23) => Row(...)} Is there a known way to get around this issue? p.s. Here is a full traceback:

UNSUBSCRIBE

2017-01-12 Thread williamtellme123
From: Harjit Singh [mailto:harjit.si...@deciphernow.com] Sent: Tuesday, April 26, 2016 3:11 PM To: user@spark.apache.org Subject: test

Re: mysql and Spark jdbc

2017-01-12 Thread Asher Krim
Have you tried using an alias? You should be able to replace ("dbtable”,"sometable") with ("dbtable”,"SELECT utc_timestamp AS my_timestamp FROM sometable") -- Asher Krim Senior Software Engineer On Thu, Jan 12, 2017 at 10:49 AM, Jorge Machado wrote: > Hi Guys, > > I’m having a

Re: [Spark Core] Re-using dataframes with limit() produces unexpected results

2017-01-12 Thread Ant Super
I get the correct result occasionally. Try more often. Or increase the range number. Then it will break down eventually. I use spark 2.0.0.2.5 in yarn client mode. Am 12.01.2017 15:13 schrieb "Takeshi Yamamuro" : Hi, I got the correct answer. Did I miss something? //

spark locality

2017-01-12 Thread vincent gromakowski
Hi all, Does anyone have experience running Spark on Mesos with CNI (ip per container) ? How would Spark use IP or hostname for data locality with backend framework like HDFS or Cassandra ? V

Running a spark code using submit job in google cloud platform

2017-01-12 Thread Anahita Talebi
Dear all, I am trying to run a .jar file as a job using submit job in google cloud console. https://cloud.google.com/dataproc/docs/guides/submit-job I actually ran the spark code on my local computer to generate a .jar file. Then in the Argument folder, I give the value of the arguments that I

mysql and Spark jdbc

2017-01-12 Thread Jorge Machado
Hi Guys, I’m having a issue loading data with a jdbc connector My line of code is : val df = sqlContext.read.format("jdbc").option("url","jdbc:mysql://localhost:3306

Re: How to save spark-ML model in Java?

2017-01-12 Thread Asher Krim
What version of Spark are you on? Although it's cut off, I think your error is with RandomForestClassifier, is that correct? If so, you should upgrade to spark 2 since I think this class only became writeable/readable in Spark 2 ( https://github.com/apache/spark/pull/12118) On Thu, Jan 12, 2017

[Spark Streaming] NoClassDefFoundError : StateSpec

2017-01-12 Thread Ramkumar Venkataraman
Spark: 1.6.1 I am trying to use the new mapWithState API and I am getting the following error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/StateSpec$ Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.StateSpec$ Build.sbt

Support of Theta Join

2017-01-12 Thread Mahender Sarangam
Hi All, Is there any support of theta join in SPARK. We want to identify the country based on range on IP Address (we have in our DB) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

failed to launch org.apache.spark.deploy.master.Master

2017-01-12 Thread Soheila S.
Hi, I have executed my spark job using spark-submit on my local machine and on cluster. Now I want to try using HDFS. I mean put the data (text file) on hdfs and read from there, execute the jar file and finally write the output to hdfs. I got this error after running the job: *failed to launch

Re: [Spark Core] Re-using dataframes with limit() produces unexpected results

2017-01-12 Thread Takeshi Yamamuro
Hi, I got the correct answer. Did I miss something? // maropu --- Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 /_/ Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)

Spark with oozie #Not implemented by the TFS FileSystem# issue

2017-01-12 Thread Rohit Mishra
Hello, I am new to spark. I need to run a spark job within oozie. individually i am able to run the spark job but with oozie after the job is launched i am getting the following error: 017-01-12 13:51:57,696 INFO [main] org.apache.hadoop.service.AbstractService: Service

Re: Spark and Kafka integration

2017-01-12 Thread Jacek Laskowski
Hi Phadnis, I found this in http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html: > This version of the integration is marked as experimental, so the API is > potentially subject to change. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering

Spark and Kafka integration

2017-01-12 Thread Phadnis, Varun
Hello, We are using Spark 2.0 with Kafka 0.10. As I understand, much of the API packaged in the following dependency we are targeting is marked as "@Experimental" org.apache.spark spark-streaming-kafka-0-10_2.11 2.0.0 What are implications of this being marked as experimental?

Re: How to save spark-ML model in Java?

2017-01-12 Thread Md. Rezaul Karim
Hi Malshan, The error says that one (or more) of the estimators/stages is either not writable or compatible that supports overwrite/model write operation. Suppose you want to configure an ML pipeline consisting of three stages (i.e. estimator): tokenizer, hashingTF, and nb: val nb = new

Re: Add row IDs column to data frame

2017-01-12 Thread ayan guha
Just in case you are more comfortable with SQL, row_number over () should also generate an unique id. On Thu, Jan 12, 2017 at 7:00 PM, akbar501 wrote: > The following are 2 different approaches to adding an id/index to RDDs and > 1 > approach to adding an index to a

Re: Re: Re: how to change datatype by useing StructType

2017-01-12 Thread lk_spark
I have try like this: val peopleRDD = spark.sparkContext.textFile("/sourcedata/test/test*") val rowRDD = peopleRDD.map(_.split(",")).map(attributes => { val ab = ArrayBuffer[Any]() for (i <- 0 until schemaType.length) { if

How to save a pipeline model in spark-ML(JAVA) ?

2017-01-12 Thread Minudika Malshan
Hi, When I try to save a pipeline model using spark ML (Java) , the following exception is thrown. java.lang.UnsupportedOperationException: Pipeline write will fail on this Pipeline because it contains a stage which does not implement Writable. Non-Writable stage: rfc_98f8c9e0bd04 of type class

How to save spark-ML model in Java?

2017-01-12 Thread Minudika Malshan
Hi, When I try to save a pipeline model using spark ML (Java) , the following exception is thrown. java.lang.UnsupportedOperationException: Pipeline write will fail on this Pipeline because it contains a stage which does not implement Writable. Non-Writable stage: rfc_98f8c9e0bd04 of type class

Re: Add row IDs column to data frame

2017-01-12 Thread akbar501
The following are 2 different approaches to adding an id/index to RDDs and 1 approach to adding an index to a DataFrame. Add an index column to an RDD ```scala // RDD val dataRDD = sc.textFile("./README.md") // Add index then set index as key in map() transformation // Results in RDD[(Long,