Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-04 Thread Michal Šenkýř
Hello John, 1. If a task complete the operation, it will notify driver. The driver may not receive the message due to the network, and think the task is still running. Then the child stage won't be scheduled ? Spark's fault tolerance policy is, if there is a problem in processing a task or a

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-04 Thread Marco Mistroni
Hi In python you can use date time.fromtimestamp(..).strftime('%Y%m%d') Which spark API are you using? Kr On 5 Dec 2016 7:38 am, "Devi P.V" wrote: > Hi all, > > I have a dataframe like following, > > ++---+ > |client_id

How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-04 Thread Devi P.V
Hi all, I have a dataframe like following, ++---+ |client_id |timestamp| ++---+ |cd646551-fceb-4166-acbc-b9|1477989416803 | |3bc61951-0f49-43bf-9848-b2|1477983725292 | |688a

Re: Access multiple cluster

2016-12-04 Thread ayan guha
Thank you guys. I will try JDBC route if I get access and let you know. On Mon, Dec 5, 2016 at 5:17 PM, Jörn Franke wrote: > If you do it frequently then you may simply copy the data to the > processing cluster. Alternatively, you could create an external table in > the processing cluster to the

Re: Access multiple cluster

2016-12-04 Thread Jörn Franke
If you do it frequently then you may simply copy the data to the processing cluster. Alternatively, you could create an external table in the processing cluster to the analytics cluster. However, this has to be supported by appropriate security configuration and might be less an efficient then c

how to use Dataset of transform method

2016-12-04 Thread LQ
how to write in java public Dataset transform(scala.Function1,Dataset> t)Concise syntax for chaining custom transformations.def featurize(ds: Dataset[T]): Dataset[U] = ...ds .transform(featurize) .transform(...) Parameters:t - (undocumented)Returns:(undocumented)Since

Can spark support exactly once based kafka ? Due to these following question?

2016-12-04 Thread John Fang
1. If a task complete the operation, it will notify driver. The driver may not receive the message due to the network, and think the task is still running. Then the child stage won't be scheduled ? 2. how do spark guarantee the downstream-task can receive the shuffle-data completely. As fact, I

Re: SVM regression in Spark

2016-12-04 Thread Evgenii Morozov
I don’t think there is such an algo. Originally SVM is for classification, but there is some twicked version that do regression, but unfortunately this is not available in apache spark, AFAIK. > On 01 Dec 2016, at 02:53, roni wrote: > > Hi Spark expert, > Can anyone help for doing SVR (Suppor

Re: Spark shuffle: FileNotFound exception

2016-12-04 Thread Evgenii Morozov
Swapnil, What do you think might be the size of the file that’s not found? For spark version below 2.0.0 there might be issues with blocks of size 2g. Is the file actually on a file system? I’d try to increase default parallelism to make sure partitions got smaller. Hope, this helps. > On 04

Re: Access multiple cluster

2016-12-04 Thread Mich Talebzadeh
The only way I think of would be accessing Hive tables through their respective thrift servers running on different clusters but not sure you can do it within Spark. Basically two different JDBC connections. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2

Access multiple cluster

2016-12-04 Thread ayan guha
Hi Is it possible to access hive tables sitting on multiple clusters in a single spark application? We have a data processing cluster and analytics cluster. I want to join a table from analytics cluster with another table in processing cluster and finally write back in analytics cluster. Best Ay

Re: Design patterns for Spark implementation

2016-12-04 Thread Pradeep Gaddam
I was hoping for someone to answer this question, As it resonates with many developers who are new to Spark and trying to adopt it at their work. Regards Pradeep On Dec 3, 2016, at 9:00 AM, Vasu Gourabathina mailto:vgour...@gmail.com>> wrote: Hi, I know this is a broad question. If this is no

RE: Creating schema from json representation

2016-12-04 Thread Mendelson, Assaf
Answering my own question (for those who are interested): val schema = df.schema val jsonString = schema.json val backToSchema = DataType.fromJson(jsonString).asInstanceOf[StructType] From: Mendelson, Assaf [mailto:assaf.mendel...@rsa.com] Sent: Sunday, December 04, 2016 11:11 AM To: user Subj

Creating schema from json representation

2016-12-04 Thread Mendelson, Assaf
Hi, I am trying to save a spark dataframe schema in scala. I can do df.schema.json to get the json string representation. Now I want to get the schema back from the json. However, it seems I need to parse the json string myself, get its fields object and generate the fields manually. Is there a b

Re: RDD getPartitions() size and HashPartitioner numPartitions

2016-12-04 Thread Manish Malhotra
Its a pretty nice question ! I'll trying to understand the problem, and see can help further. When you say CustomRDD I believe you will using it in the transformation stage, once the data is read from a file source like HDFS or Cassandra or Kafka. Now the RDD.getPartitions() should return the pa