Re: Add row IDs column to data frame

2017-01-11 Thread akbar501
RDDs, DataFrames and Datasets are all immutable. So, you cannot edit any of these. However, the approach you should take is to call transformation functions on the RDD/DataFrame/Dataset. RDD transformation functions will return a new RDD, DataFrame transformations will return a new DataFrame and

[Spark Core] Re-using dataframes with limit() produces unexpected results

2017-01-11 Thread Ant
Hello, it seems using a Spark DataFrame, which had limit() applied on it, in further calculations produces very unexpected results. Some people use it as poor man's sampling and end up debugging for hours to see what's going on. Here is an example |from pyspark.sql import Row

Re: Re: how to change datatype by useing StructType

2017-01-11 Thread lk_spark
yes, field year is in my data: data: kevin,30,2016 shen,30,2016 kai,33,2016 wei,30,2016 this will not work val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0),attributes(1),attributes(2))) but I need read data by configurable. 2017-01-12 lk_spark

Re: how to change datatype by useing StructType

2017-01-11 Thread ayan guha
Do you have year in your data? On Thu, 12 Jan 2017 at 5:24 pm, lk_spark wrote: > > > > > > > > > > > > > > > > > > > hi,all > > > I have a txt file ,and I want to process it as dataframe > > : > > > > > > data like this : > > >name1,30 > > >name2,18 > >

how to change datatype by useing StructType

2017-01-11 Thread lk_spark
hi,all I have a txt file ,and I want to process it as dataframe : data like this : name1,30 name2,18 val schemaString = "name age year" val xMap=new scala.collection.mutable.HashMap[String,DataType]() xMap.put("name", StringType) xMap.put("age", IntegerType)

Re: Time-Series Analysis with Spark

2017-01-11 Thread vaquar khan
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/timeseries/index.html Regards, Vaquar khan On Wed, Jan 11, 2017 at 10:07 AM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > Hello Rishabh, > We have done some forecasting, for time-series, using ARIMA

Handling Input Error in wholeTextFiles

2017-01-11 Thread khwunchai jaengsawang
Hi all, I have a requirement to process multiple splittable gzip files and the results need to include each individual file name. I come across a problem when loading multiple gzip files using wholeTextFiles method and some files are corrupted causing ‘unexpected end of input stream' error and

Re: Time-Series Analysis with Spark

2017-01-11 Thread Rishabh Bhardwaj
spark-ts currently do not support Seasonal ARIMA. There is an open Issue for the same: https://github.com/sryza/spark-timeseries/issues/156 On Wed, Jan 11, 2017 at 3:50 PM, Sean Owen wrote: > https://github.com/sryza/spark-timeseries ? > > On Wed, Jan 11, 2017 at 10:11 AM

Give a task more resources

2017-01-11 Thread Pat Ferrel
I have a task that would benefit from more cores but the standalone scheduler launches it when only a subset are available. I’d rather use all cluster cores on this task. Is there a way to tell the scheduler to finish everything before allocating resources to a task? Like "finish everything

Re: structured streaming polling timeouts

2017-01-11 Thread Shixiong(Ryan) Zhu
No. I think increasing the timeout should work. Spark 2.1.0 changed this timeout to 120 seconds as we found the default value in 2.0.2 is too small. On Wed, Jan 11, 2017 at 12:01 PM, Timothy Chan wrote: > We're currently using EMR and they are still on Spark 2.0.2. > > Do

Re: structured streaming polling timeouts

2017-01-11 Thread Timothy Chan
We're currently using EMR and they are still on Spark 2.0.2. Do you have any other suggestions for additional parameters to adjust besides "kafkaConsumer.pollTimeoutMs"? On Wed, Jan 11, 2017 at 11:17 AM Shixiong(Ryan) Zhu wrote: > You can increase the timeout using the

RE: Spark UI not coming up in EMR

2017-01-11 Thread Saurabh Malviya (samalviy)
Any clue on this. Jobs are running fine , But not able to access Spark UI in EMR -yarn. Where I can see statistics like , No of events /per sec and rows processed for streaming in log files (If UI is not working) -Saurabh From: Saurabh Malviya (samalviy) Sent: Monday, January 09, 2017 10:59

Re: structured streaming polling timeouts

2017-01-11 Thread Shixiong(Ryan) Zhu
You can increase the timeout using the option "kafkaConsumer.pollTimeoutMs". In addition, I would recommend you try Spark 2.1.0 as there are many improvements in Structured Streaming. On Wed, Jan 11, 2017 at 11:05 AM, Timothy Chan wrote: > I'm using Spark 2.0.2 and running

structured streaming polling timeouts

2017-01-11 Thread Timothy Chan
I'm using Spark 2.0.2 and running a structured streaming query. When I set startingOffsets to earliest I get the following timeout errors: java.lang.AssertionError: assertion failed: Failed to get records for spark-kafka-source-be89d84c-f6e9-4d2b-b6cd-570942dc7d5d-185814897-executor

Re: Unable to build spark documentation

2017-01-11 Thread Sean Owen
Are you using Java 8? Hyukjin fixed up all the errors due to the much stricter javadoc 8, but it's possible some creep back in because there is no Java 8 test now. On Wed, Jan 11, 2017 at 6:22 PM Krishna Kalyan wrote: > Hello, > I have been trying to build spark

Re: Handling null in dataset

2017-01-11 Thread A Shaikh
I tried DataFrame option below, not sure what that is for but doesnt seems to work. - nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. On 11 January 2017 at 17:11, A Shaikh wrote: > > > How does

Unable to build spark documentation

2017-01-11 Thread Krishna Kalyan
Hello, I have been trying to build spark documentation. Instruction followed from link below https://github.com/apache/spark/blob/master/docs/README.md My Jekyll build fails below. (Error in the gist below) https://gist.github.com/krishnakalyan3/d0e38852efe97d7899d737b83b8d8702 and

Re: Shortest path performance in Graphx with Spark

2017-01-11 Thread Irving Duran
Hi Gerard, How are you starting spark? Are you allocating enough RAM for processing? I think the default is 512mb. Try to doing the following and see if it helps (based on the size of your dataset, you might not need all 8gb). $SPARK_HOME/bin/spark-shell \ --master local[4] \

Handling null in dataset

2017-01-11 Thread A Shaikh
How does Spark handle null values. case class AvroSource(name: String, age: Integer, sal: Long, col_float: Float, col_double: Double, col_bytes: String, col_bool: Boolean ) val userDS = spark.read.format("com.databricks.spark.avro").option("nullValue",

Re: Time-Series Analysis with Spark

2017-01-11 Thread Dirceu Semighini Filho
Hello Rishabh, We have done some forecasting, for time-series, using ARIMA in our project, it's on top of Spark and it's open source https://github.com/eleflow/uberdata Kind Regards, Dirceu 2017-01-11 8:20 GMT-02:00 Sean Owen : > https://github.com/sryza/spark-timeseries ? >

Re: Nested ifs in sparksql

2017-01-11 Thread Raghavendra Pandey
I am not using case when. It is mostly IF. By slow, I mean 6 min even for 10 records for 41 level nested ifs. On Jan 11, 2017 3:31 PM, "Georg Heiler" wrote: > I was using the dataframe api not sql. The main problem was that too much > code was generated. > Using an

Re: Time-Series Analysis with Spark

2017-01-11 Thread Sean Owen
https://github.com/sryza/spark-timeseries ? On Wed, Jan 11, 2017 at 10:11 AM Rishabh Bhardwaj wrote: > Hi All, > > I am exploring time-series forecasting with Spark. > I have some questions regarding this: > > 1. Is there any library/package out there in community of

Time-Series Analysis with Spark

2017-01-11 Thread Rishabh Bhardwaj
Hi All, I am exploring time-series forecasting with Spark. I have some questions regarding this: 1. Is there any library/package out there in community of *Seasonal ARIMA* implementation in Spark? 2. Is there any implementation of Dynamic Linear Model (*DLM*) on Spark? 3. What are the

Re: Nested ifs in sparksql

2017-01-11 Thread Georg Heiler
I was using the dataframe api not sql. The main problem was that too much code was generated. Using an unforgettable turned out to be quicker as well. Olivier Girardot schrieb am Di. 10. Jan. 2017 um 21:54: > Are you using the "case when" functions ? what do you

Kryo and Spark 1.6.0 - Does it require a default empty constructor?

2017-01-11 Thread Enrico DUrso
Hi, I have a doubt about Kryo and Spark 1.6.0. I read that for using Kryo, the class that you want to serialize must have a default constructor. I created a simple class avoiding to insert such a constructor and If I try to serialize manually, it does not work. But If I use that class in Spark

RE: Kryo On Spark 1.6.0 [Solution in this email]

2017-01-11 Thread Enrico DUrso
Yes sure, you can find it here: http://stackoverflow.com/questions/34736587/kryo-serializer-causing-exception-on-underlying-scala-class-wrappedarray hope it works, I did not try, I am using Java. To be precise I found the solution for my problem: To sum up, I had problems in registering the

Book recommendations

2017-01-11 Thread Esa Heikkinen
Hi Does anyone know of good books about event processing in distributed event systems (like IoT-systems) ? I have already read book: "Power of Events" (Luckham 2002), but are there exist newer ones ? Best Regards Esa Heikkinen